Looking into floating point calculations

To frame this post with a bit of context. Say you wanted to calculate a value where the formula have a infinite term. We’ll choose the exponential function:

Say you wanted to calculate exp(y) to a specific decimal precision p.

The full code we’ll be working with is the following:

bool exp(int x, int precision, double& expVal)
{
   double totalPrecision = pow(10, precision);
   double deltaDiff = (double)1 / totalPrecision;

   // Validate input.
   if (x <= 0)
   {
      return 1;
   }

   double total = 1.0;
	
   // keep track of the previous value to get 
   // O(n) complexity.
   double expo = 1;
   double fact = 1;
   double totalPre = 1;
   double diff = x + 2.0;

   int i = 1;
   while (deltaDiff < diff)
   {
      // Calculation for exponential and factorial.
      expo *= x;
      fact *= i;

      total += expo / fact;

      // Calculate the difference to determine 
      // if we have enough precision.
      diff = total - totalPre;

      // Return value if we have converged on 
      // enough precision.
      if (deltaDiff > diff)
      {
         expVal = total;
         return true;
      }

      totalPre = total;
      ++i;
   }

   // Failure. Too much precision.
   return false;
}

The key part of the code is really this part where we calculate the new value, subtract the previous value, and then test to see if the different is smaller than the precision we are looking forecasting

// Calulate the difference to determine 
// if we have enough precision.
diff = total - totalPre;

// Return value if we have converged
// on enough precision.
if (deltaDiff > diff)
{
   expVal = total;
   return true;
}

It all looks fairly good so far. However the fly in the ointment is that past a certain point, the difference between the new total and the previous is 0. Even if the desired precision is not reached.

If we set z=3 and 20 decimal points of precision, around the 33rd loop, the diff = 0. The value we’re adding to the diff is small:
1.3406779175591597601160505856455e-14

If you recall floating point numbers are stored in a mantissa and exponent. Basically mantissa * 2^exponent. Since the mantissa is 52 bits that gives us 15 decimal of precision. Anything past that and we’re toast. Therefore with this simple algorithm we have to cap the precision to a total of 15 decimals.

But is there more to consider? Yes there is. As we add numbers with large precision, we constantly truncate those below the 15 decimals of precision. This leads to an error which is truncated. For example, if we only had 5 decimals of precision. If we added x = 1.2345 + 2000 = 2001.2. We lost 0.0345. These can add up over time. An algorithm tracking these errors is the Kahan algorithm.

And that’s about as far as I’ll go on floating point precision calculations for this post. As you can see, it goes much further down the rabbit hole.

Leave a Reply

Your email address will not be published. Required fields are marked *