An alternative definition of stochastic rounding is:

where x is some real value, RA stands for round-away-from-zero, RZ is a rounding mode round toward-zero, and P belongs to [0, 1) is a random number from a uniform distribution. Notice that in the equation, x−RZ(x) is the distance between x and a floating-point number on the left of it (on the negative side of the axis it would be right of it), and |RA(x)−RZ(x)| is the distance between the two floating-point values that surround x.

Part I: Testing the Binary32 Stochastic Rounding

Two implementations of (1) are available in the C program file provided for this assignment, which is assignment1.c. These are float SR(double x) and float SR alternative(double x), and as shown by the definitions, they take as inputs binary64 numbers and produce binary32 numbers, which are stochastically rounded. Special cases such as overflow are not covered here for simplicity.

The code can be compiled by running gcc -o assignment1 assignment1.c -lm in a linux terminal, and run with ./assignment1. The flag -lm may be required on some compilers to use math.h functions. For this part of the assignment, write your C code in the int main() method, and complete the following tests.

1. Pick one of the functions that implement Eqn. (1), and decribe in your report how each line of the function implements the definition of stochastic rounding.

2. Test one of the functions in order to show that it correctly implements Eqn. (1). Pick a value that is representable in binary64 more accurately than in binary32, for example π (use M PI) which is predefined in C, or 1/3, and round it 5 million times using one of the SR functions provided.

3. Count the number of times rounding was up and down, and use these counters to confirm that the observed probabilities of rounding up and down are close to expected. To calculate the expected probabilities round the binary64 value up and down (for example using nextafterf() function available in C) to obtain the two neighbouring binary32 values, and compute the distances of each of them to the binary64 value, and divide by the distance to each other. You may also find the standard C function fabs() useful here.

4. Furthermore, calculate the average value from all the rounded binary32 answers. Plot the absolute error of the average of all the rounded binary32 values compared with the original binary64 as the number of roundings i increases (sample at fixed steps such as on every 1000th step). Have i on the x-axis and errors on the y-axis. You can print out the data into the terminal using printf(), write it into a file, and plot the diagrams using for example gnuplot (and importing images into your report), pgfplots directly in LATEX(importing data into latex from a .txt file), or any other tool of your choice. Analyse how it changes as more roundings are performed. How does it change in relation to the unrounded binary64 value?

Marks will be given for the correctness of the code using one of the SR function implementations and the code that allowed to compute expected and observed rounding probabilities and gather the data. To obtain more marks, your code should be concise, with descriptive variable names and informative comments, lines of code no longer than 79 columns, and no commented out (testing) code should appear in the submitted version unless it is a useful code for printing out some data for example.

Part II: Harmonic series

Stagnation occurs in floating-point arithmetics when for some FP values a and b, where a >> b, RN(a + b) = a. It can occur in any application that uses FP numbers, but is most easily shown with a toy problem called harmonic series, introduced in week 3. It is defined as:

and in infinite precision is known to be a divergent series, but converges to some value in limited precision computer arithmetics. To analyse this on computers, we use a truncated version of the series:

In this exercise we will be analysing what happens to this series in different arithmetics, rounding modes, or summation methods. You may find [5] to be a useful research paper to have a look at for this part.

In the same code file as was used for Part I, implement the harmonic series and analyse it. A recursive summation method with binary32 arithmetic and round to nearest mode (default) in the addition and division operations, is already given as a starting point.

Use recursive summation method with the binary32 arithmetic, but simulate stochastic rounding. To do this, compute division and addition in binary64 arithmetic, and round the result using one of the provided SR functions, on each iteration, to update the binary32 result fharmonic sr.

Use the compensated summation with the binary32 arithmetic and round to nearest mode. This will require the fastTwoSum function.

Use recursive summation method with the binary64 arithmetic and round to nearest (default). Here store the result in binary64 rather than binary32 as previously. This acts as a reference for comparison since it is expected to be the most accurate result.

Run the harmonic series for N = 500000000 (five hundred million) iterations. Report at what i step the basic binary32 implementation with the recursive summation stagnates. Additionally, compute the absolute errors of the three binary32 harmonic series values compared with the binary64 one, and report which one of the three has the lowest error and which one the highest.

Plot the absolute errors (take absolute values of absolute errors to make all errors positive) of each of the three harmonic series at regular points (every 1000000th iteration) from i = 1 to i = 500000000 (five hundred million). The x-axis should contain i, while the y-axis the errors of the harmonic series. Set both axes to logarithmic scale. Once again, it is best to printout the data into a file and use any plotting software to plot it by reading from the text file.

Part III: Other Applications of SR

Demonstrate another algorithm or application where SR can improve performance, and evaluate the effectiveness of SR, for example by analysing performance with the four arithmetics used in Part II. Consider research papers, such as [1, 2, 3, 5], and other research sources for inspiration, and document in your report the method, results and analysis following the format of Part II. The awarded marks for Part III are the same as Parts I & II, so you should aim to invest similar effort for this section.

References

[1] M. P. Connolly, N. J. Higham, and T. Mary, Stochastic rounding and its probabilistic backward error analysis, SIAM J. Sci. Comput., 43 (2021), pp. A566–A585.

[2] M. Croci, M. Fasi, N. J. Higham, T. Mary, and M. Mikaitis, Stochastic rounding: Implementation, error analysis, and applications, MIMS EPrint 2021.17, Oct. 2021. Revised Jan. 2022. To appear in R. Soc. Open Sci.

[3] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, Deep learning with limited numerical precision, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, July 2015, PMLR, pp. 1737–1746.

[4] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (revision of IEEE Std 754- 2008), Institute of Electrical and Electronics Engineers, Piscataway, NJ, USA, July 2019.

[5] D. Malone, To what does the harmonic series converge?, Irish Math. Soc. Bull., (2013), pp. 59–66.

Implementation

```
union DoubleToInt
{
double dVal;
uint64_t iVal;
};
/*
A function that rounds a binary64 value to a binary32 value
stochastically. Implemented by treating FP number representations
as integer values.
*/
float SR(double x)
{
union DoubleToInt temp;
temp.dVal = x;
uint32_t r = rand() & 0x1FFFFFFF;
temp.iVal += r;
temp.iVal = temp.iVal & 0xFFFFFFFFE0000000;
return (float)temp.dVal;
}
/*
A function that rounds a binary64 value to a binary32 value
stochastically. Implemented by working at the FP arithmetic level rather
than manipulating the binary representation of FP numbers directly.
*/
float SR_alternative(double x)
{
float closest = (float)x;
float down, up;
if (closest > x)
{
down = nextafterf(closest, -INFINITY);
up = closest;
}
else
{
up = nextafterf(closest, +INFINITY);
down = closest;
}
double r = (double)rand() / RAND_MAX;
double prob_up = fabs(x - down) / fabs(up - down);
if (r < prob_up)
return up;
else
return down;
}
const long int K = 5000000;
double avg;
int main()
{
/* --------------------------------- */
/* PART 1 */
/* --------------------------------- */
// An arbitrary value for rounding.
double sample = M_PI;
// Calculate the neighbouring binary32 values.
float closest = (float)sample;
float down, up;
if (closest > sample)
{
down = nextafterf(closest, -INFINITY);
up = closest;
}
else
{
// TODO
}
// Round many times, and calculate the average values as well as count
// the numbers of times rounding was up/down.
int countup, countdown = 0;
double avg;
for (int i = 1; i <= K; i++)
{
// TODO
}
// Print out some useful stats.
printf("Value being rounded: %.60f \n", sample);
printf("Binary32 value before: %.60f \n", down);
printf("Binary32 value after: %.60f \n", up);
printf("Closest binary32: %.60f \n", closest);
// Print out the average of all rounded values
// TODO
// Check that SR function is correct by comparing the probabilities
// of rounding up/down, and the expected probability. Print them out
// below.
// TODO
/* --------------------------------- */
/* PART 2 */
/* --------------------------------- */
long int N = 500000000;
float fharmonic = 0;
float fharmonic_sr = 0;
float fharmonic_comp = 0;
double dharmonic = 0;
// Error term in the compensated summation.
float t = 0;
for (int i = 1; i <= N; i++)
{
// Recursive sum, binary32 RN
fharmonic += (float)1 / i;
// Other summation methods, TODO.
}
printf("Values of the harmonic series after %ld iterations \n", N);
printf("Recursive summation, binary32: %.30f \n", fharmonic);
printf("Recursive summation with SR, binary32: %.30f \n", fharmonic_sr);
printf("Compensated summation, binary32: %.30f \n", fharmonic_comp);
printf("Recursive summation, binary64: %.30f \n", dharmonic);
/* --------------------------------- */
/* PART 3 */
/* --------------------------------- */
// TODO
}
```

If you need solution of above problem or any other C programming related help then you can comment in below comment section or send your requirement details at:

realcode4you@gmail.com

## Comments