When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number. Operations performed in this manner will be called exactly rounded. The previous section gave several examples of algorithms that require a guard digit in order to work properly. This section gives examples of algorithms that require exact rounding. So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should Another school of thought says that since numbers ending in 5 are halfway between two possible roundings, they should round down half the time and round up the other half.

Thus Which of these methods is best, round up or round to even? Reiser and Knuth [] offer the following reason for preferring round to even. When rounding up, the sequence becomes. Under round to even, x n is always 1. This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen. Throughout the rest of this paper, round to even will be used. One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language.

## World Scientific (e-Books)

The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic. The key to multiplication in this system is representing a product x y as a sum, where each summand has the same precision as x and y.

This can be done by splitting x and y. When p is even, it is easy to find a splitting. The number x 0. When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. There is more than one way to split a number. A splitting method that is easy to compute is due to Dekker [], but it requires more than a single guard digit. Then b 2 - ac rounded to the nearest floating-point number is. This is an error of ulps. Finally, subtracting these two series term by term gives an estimate for b 2 - ac of 0.

As a final example of exact rounding, consider dividing m by Actually, a more general fact due to Kahan is true. We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are "correct" in the sense they have a small relative error. The section Cancellation discussed several algorithms that require guard digits to produce correct results in this sense. If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting.

The reason is that the benign cancellation x - y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7. These are useful even if every floating-point variable is only an approximation to some actual value. There are two different IEEE standards for floating-point computation.

It also specifies the precise layout of bits in a single and double precision. It does not require a particular value for p , but instead it specifies constraints on the allowable values of p for single and double precision. This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included.

It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide an introduction to its use. Base ten is how humans exchange and think about numbers. There are several reasons why IEEE requires that if the base is not 10, it must be 2. The section Relative Error and Ulps mentioned one reason: the results of error analyses are much tighter when is 2 because a rounding error of.

A related reason has to do with the effective precision for large bases. Both systems have 4 bits of significand. In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4 p - 3 rather than 4 p binary bits. Only IBM knows for sure, but there are two possible reasons.

The first is increased exponent range. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit. When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation.

Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-point Formats that this requires a special convention for 0. The method given there was that an exponent of e min - 1 and a significand of all zeros represents not , but rather 0.

IEEE single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand. The IEEE standard defines four different precisions: single, double, single-extended, and double-extended. In IEEE , single and double precision correspond roughly to what most floating-point hardware provides.

Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. The IEEE standard only specifies a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as bit format , even though the table shows it using 79 bits. The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits. The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported, One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally.

By displaying only 10 of the 13 digits, the calculator appears to the user as a "black box" that computes exponentials, cosines, etc. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with.

It is not hard to find a simple rational expression that approximates log with an error of units in the last place. Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator. Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about. However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user.

To illustrate extended precision further, consider the problem of converting between IEEE single precision and decimal. Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered. It turns out that 9 decimal digits are enough to recover a single precision binary number see the section Binary to Decimal Conversion.

When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer. Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one.

First read in the 9 decimal digits as an integer N , ignoring the decimal point. Next find the appropriate power 10 P necessary to scale N. This will be a combination of the exponent of the decimal number, together with the position of the up until now ignored decimal point. Compute 10 P. If this last operation is done exactly, then the closest binary number is recovered. The section Binary to Decimal Conversion shows how to do the last multiply or divide exactly. Thus for P 13, the use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number i.

If double precision is supported, then the algorithm above would be run in double precision rather than single-extended, but to convert double precision to a digit decimal number and back would require the double-extended format. Since the exponent can be positive or negative, some method must be chosen to represent its sign. The two's complement representation is often used in integer arithmetic. In this scheme, a number in the range [-2 p-1 , 2 p-1 - 1] is represented by the smallest nonnegative number that is congruent to it modulo 2 p.

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation. In the case of single precision, where the exponent is stored in 8 bits, the bias is for double precision it is What this means is that if is the value of the exponent bits interpreted as an unsigned integer, then the exponent of the floating-point number is - This is often called the unbiased exponent to distinguish from the biased exponent. Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow.

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number using round to even. The section Guard Digits pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different.

That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small. However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded [Goldberg ].

Thus the standard can be implemented efficiently. One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point.

Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds. Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified.

Brown [] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations , which require features not present on all hardware.

Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded. There is not complete agreement on what operations a floating-point standard should cover. It also requires that conversion between internal formats and decimal be correctly rounded except for very large numbers. Kulisch and Miranker [] have proposed adding inner product to the list of operations that are precisely specified.

They note that when inner products are computed in IEEE arithmetic, the final answer can be quite wrong. It is possible to compute inner products to within 1 ulp with less hardware than it takes to implement a fast multiplier [Kirchner and Kulish ]. All the operations mentioned in the standard are required to be exactly rounded except conversion between decimal and binary.

The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. For conversion, the best known efficient algorithms produce results that are slightly worse than exactly rounded ones [Coonen ]. The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker's dilemma.

To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp 1. Should this be rounded to 5. If exp 1. And then 5. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp 1. Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically.

But there does not appear to be a single algorithm that works well across all hardware architectures. Rational approximation, CORDIC, 16 and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware. On some floating-point hardware every bit pattern represents a valid floating-point number. On the other hand, the VAX TM reserves some bit patterns to represent special numbers called reserved operands.

Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation. Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. However, there are examples where it makes sense for a computation to continue in such a situation. Consider a subroutine that finds the zeros of a function f , say zero f. Traditionally, zero finders require the user to input an interval [ a , b ] on which the function is defined and over which the zero finder will search.

That is, the subroutine is called as zero f , a , b. A more useful zero finder would not require the user to input this extra information. This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain. However, it is easy to see why most zero finders require a domain.

The zero finder does its work by probing the function f at various values. Then when zero f probes outside the domain of f , the code for f will return NaN, and the zero finder can continue. That is, zero f is not "punished" for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be. Similarly if one operand of a division operation is a NaN, the quotient should be a NaN. In general, whenever a NaN participates in a floating-point operation, the result is another NaN. Another approach to writing a zero solver that doesn't require the user to input a domain is to use signals.

The zero-finder could install a signal handler for floating-point exceptions. Then if f was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals if it has a method at all , and so it has no hope of portability. Implementations are free to put system-dependent information into the significand.

Thus there is not a unique NaN, but rather a whole family of NaNs. When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated.

Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first. This is much safer than simply returning the largest representable number. So the final result is , which is safer than returning an ordinary floating-point number that is nowhere near the correct answer.

The division of 0 by 0 results in a NaN. You can distinguish between getting because of overflow and getting because of division by zero by checking the status flags which will be discussed in detail in section Flags. The overflow flag will be set in the first case, the division by zero flag in the second. The rule for determining the result of an operation that has infinity as an operand is simple: replace infinity with a finite number x and take the limit as x.

When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN. Here is a practical example that makes use of the rules for infinity arithmetic. Zero is represented by the exponent e min - 1 and a zero significand. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer. Another example of the use of signed zero concerns underflow and functions that have a discontinuity at 0, such as log.

Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -. Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number. Probably the most interesting use of signed zero occurs in complex arithmetic.

To take a simple example, consider the equation. This is certainly true when z 0. The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration.

Signed zero provides a perfect way to resolve this problem. In fact, the natural formulas for computing will give these results. Back to. Thus IEEE arithmetic preserves this identity for all z. Some more sophisticated examples are given by Kahan []. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages. How important is it to preserve the property.

Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code: it helps to have provable facts on which to depend.

Similarly, knowing that 10 is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything. The IEEE standard uses denormalized 18 numbers, which guarantee 10 , as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard.

The exponent e min is used to represent denormals. More formally, if the bits in the significand field are b 1 , b 2 , This behavior is called gradual underflow. It is easy to verify that 10 always holds when using gradual underflow. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number. If the result of a floating-point calculation falls into this gulf, it is flushed to zero. The bottom number line shows what happens when denormals are added to the set of floating-point numbers.

The "gulf" is filled in, and when the result of a calculation is less than , it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way: adjacent spacings are either the same length or differ by a factor of. Without denormals, the spacing abruptly changes from to , which is a factor of , rather than the orderly change by a factor of. Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Large relative errors can happen even without cancellation, as the following example shows [Demmel ]. The obvious formula. A better method of computing the quotients is to use Smith's formula:. It yields 0. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1. When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do.

When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are "sticky" in that once set, they remain set until explicitly cleared. Sometimes continuing execution in the face of exception conditions is not appropriate.

The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined. The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact.

There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. The inexact exception is raised when the result of a floating-point operation is not exact. Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. There is an implementation issue connected with the fact that the inexact exception is raised so often. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive.

This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled. One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process.

There is a more interesting use for trap handlers that comes up when computing products such as that could potentially overflow. One solution is to use logarithms, and compute exp instead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression , even if there is no overflow. The idea is as follows.

There is a global counter initialized to zero. Whenever the partial product overflows for some k , the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. Similarly, if p k underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one. When all the multiplications are done, if the counter is zero then the final product is p n. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost.

IEEE specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2 , and then rounded to the relevant precision. For underflow, the result is multiplied by 2. The exponent is for single precision and for double precision. This is why 1. In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since with the exception of binary decimal conversion each operation is computed exactly and then rounded.

By default, rounding means round toward nearest. One application of rounding modes occurs in interval arithmetic another is mentioned in Binary to Decimal Conversion. The exact result of the addition is contained within the interval. Without rounding modes, interval arithmetic is usually implemented by computing and , where is machine epsilon. Since the result of an operation in interval arithmetic is an interval, in general the input to an operation will also be an interval. When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation.

This is not very helpful if the interval turns out to be large as it often does , since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size.

The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions: underflow, overflow, division by zero, invalid operation and inexact. It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use. A more sophisticated example is discussed in the section Binary to Decimal Conversion. Consider writing a subroutine to compute x n , where n is an integer.

In the second expression these are exact i. Unfortunately, these is a slight snag in this strategy. If PositivePower x, -n underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x - n underflows, then x n will either overflow or be in range. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits.

Another example of the use of flags occurs when computing arccos via the formula. The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation. The design of almost every aspect of a computer system requires knowledge about floating-point.

Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program.

This example will be analyzed in the next section.

### Mahidol University

Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E. This is hardly a cure-all because it raises as many questions as it answers. What should the value of E be? It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results.

As discussed in the section Proof of Theorem 4 , when b 2 4 ac , rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. By performing the subcalculation of b 2 - 4 ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved. The computation of b 2 - 4 ac in double precision when each of the quantities a , b , and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result.

In order to produce the exactly rounded product of two p -digit numbers, a multiplier needs to generate the entire 2 p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands.

If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn't be worth adding to an instruction set. However, this instruction has many other uses.

## Ubuy Bahrain Online Shopping For inner in Affordable Prices.

Consider the problem of solving a system of linear equations,. Suppose that a solution x 1 is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called iterative improvement. First compute. Note that if x 1 is an exact solution, then is the zero vector, as is y. The three steps 12 , 13 , and 14 can be repeated, replacing x 1 with x 2 , and x 2 with x 3.

For more information, see [Golub and Van Loan ].

## Serie: Lecture Notes Series, Institute For Mathematical Sciences, National University Of Singapore

When performing iterative improvement, is a vector whose elements are the difference of nearby inexact floating-point numbers, and so can suffer from catastrophic cancellation. Once again, this is a case of computing the product of two single precision numbers A and x 1 , where the full double precision result is needed.

To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section. The interaction of compilers and floating-point is discussed in Farnum [], and much of the discussion in this section is taken from that paper.

Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs.

### Featured channels

While this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error.

If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them. Remarkably enough, some languages don't clearly specify that if x is a floating-point variable with say a value of 3. For example Ada, which is based on Brown's model, seems to imply that floating-point arithmetic only has to satisfy Brown's axioms, and thus expressions can have one of many possible values. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined.

In the IEEE model, we can prove that 3. In Brown's model, we cannot. Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point.

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. The importance of preserving parentheses cannot be overemphasized. The algorithms presented in theorems 3, 4 and 6 all depend on it. A language definition that does not require parentheses to be honored is useless for floating-point calculations.

Subexpression evaluation is imprecisely defined in many languages. Suppose that ds is double precision, but x and y are single precision. There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks.

First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants. In the expression 0. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. The programmer will have to hunt down and change every floating-point constant.

The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided. There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie ]. This leads to anomalies like the example at the beginning of this section. The expression 3. This suggests that computing every expression in the highest precision available is not a good rule. Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial.

One way to reduce this rounding error is to accumulate the sums in double precision this will be discussed in more detail in the section Optimizers. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable. A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression.

However, this rule is too simplistic to cover all cases cleanly. A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree. Then perform a second pass from the root to the leaves. Farnum [] presents evidence that this algorithm in not difficult to implement. The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded.

This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions: since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like 0.

Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen ]. However, One definition might be to use the method shown in section Infinity. For example, to determine the value of a b , consider non-constant analytic functions f and g with the property that f x a and g x b as x 0.

If f x g x always approaches the same limit, then this should be the value of a b. In the case of 1. However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine.

However, other aspects of the standard are not so easily implemented as subroutines. For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions although the recommended configurations are single plus single-extended or single, double, and double-extended. Infinity provides another example. But that might make them unusable in places that require constant expressions, such as the initializer of a constant variable.

A more subtle situation is manipulating the state associated with a computation, where the state consists of the rounding modes, trap enable bits, trap handlers and exception flags. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful. As the examples in the section Flags show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine.

Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson ]. There are a number of minor points that need to be considered when implementing the IEEE standard in a language. Although the IEEE standard defines the basic floating-point operations to return a NaN if any operand is a NaN, this might not always be the best definition for compound operations. For example when computing the appropriate scale factor to use in plotting a graph, the maximum of a set of values must be computed.

In this case it makes sense for the max operation to simply ignore NaNs. Finally, rounding can be a problem. The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit round function in languages. This means that programs which wish to use IEEE rounding can't use the natural language primitives, and conversely the language primitives will be inefficient to implement on the ever increasing number of IEEE machines.

Compiler texts tend to ignore the subject of floating-point. For example Aho et al. However, these two expressions do not have the same semantics on a binary machine, because 0.

Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important. This is designed to give an estimate for machine epsilon. Avoiding this kind of "optimization" is so important that it is worth presenting one more very useful algorithm that is totally ruined by it.

Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms. Because each addition can potentially introduce an error as large as. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision.

If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest. However, there is a much more efficient method which dramatically improves the accuracy of sums, namely. Comparing this with the error in the Kahan summation formula shows a dramatic improvement.

Each summand is perturbed by only 2 e , instead of perturbations as large as ne in the simple formula. Details are in, Errors In Summation. These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables.

Another way that optimizers can change the semantics of floating-point code involves constants. In the expression 1. Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision. Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts 1. However, constants like Despite these examples, there are useful optimizations that can be done on floating-point code.

First of all, there are algebraic identities that are valid for floating-point numbers. However, even these simple identities can fail on a few machines such as CDC and Cray supercomputers. Instruction scheduling and in-line procedure substitution are two other potentially useful optimizations.

Perhaps they have in mind that floating-point numbers model real numbers and should obey the same laws that real numbers do.

- Infinity Books.
- W. Hugh Woodin books and biography | Waterstones.
- W. Hugh Woodin books and biography | Waterstones?
- 1. Introduction!

The problem with real number semantics is that they are extremely expensive to implement. Every time two n bit numbers are multiplied, the product will have 2 n bits. An algorithm that involves thousands of operations such as solving a linear system will soon be operating on numbers with many significant bits, and be hopelessly slow. The implementation of library functions such as sin and cos is even more difficult, because the value of these transcendental functions aren't rational numbers.

Exact integer arithmetic is often provided by lisp systems and is handy for some problems. However, exact floating-point arithmetic is rarely useful. Since these bounds hold for almost all commercial hardware, it would be foolish for numerical programmers to ignore such algorithms, and it would be irresponsible for compiler writers to destroy these algorithms by pretending that floating-point variables have real number semantics. The topics discussed up to now have primarily concerned systems implications of accuracy and precision.

Trap handlers also raise some interesting systems issues. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section Trap Handlers , gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result.

Depending on the programming language being used, the trap handler might be able to access other variables in the program as well. For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination. The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter.

Hardware support for identifying exactly which operation trapped may be necessary.

- Library and Knowledge Center.
- Library and Knowledge Center!
- Rainbows End: A Novel With One Foot In The Future.

Another problem is illustrated by the following program fragment. Suppose the second multiply raises an exception, and the trap handler wants to use the value of a. On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated.

This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly. Instead, the handler can be given the operands or result as an argument. But there are still problems. If the multiply traps, its argument z could already have been overwritten by the addition, especially since addition is usually faster than multiply. Computer systems that support the IEEE standard must provide some way to save the value of z , either in hardware or by having the compiler avoid such a situation in the first place. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems.

In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs. The advantage of presubstitution is that it has a straightforward hardware implementation.

Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers. Forgotten password Use the form below to recover your username and password. New details will be emailed to you. Simply reserve online and pay at the counter when you collect. Available in shop from just two hours, subject to availability. Your order is now being processed and we have sent a confirmation email to you at.

This item can be requested from the shops shown below. If this item isn't available to be reserved nearby, add the item to your basket instead and select 'Deliver to my local shop' at the checkout, to be able to collect it from there at a later date. Preferred contact method Email Text message. When will my order be ready to collect? Following the initial email, you will be contacted by the shop to confirm that your item is available for collection. Call us on or send us an email at.

Unfortunately there has been a problem with your order. Please try again or alternatively you can contact your chosen shop on or send us an email at.