The present invention generally relates to a parallel computing system. More particularly, the present invention relates to adding a plurality of floating point numbers in the parallel computing system.
IEEE 754 describes floating point number arithmetic. Kahan, “IEEE Standard 754 for Binary Floating-Point Arithmetic,” May 31, 1996, UC Berkeley Lecture Notes on the Status of IEEE 754, wholly incorporated by reference as if set forth herein, describes IEEE Standard 754 in detail.
According to IEEE Standard 754, to perform floating point number arithmetic, some or all floating point numbers are converted to binary numbers. However, the floating point number arithmetic does not need to follow IEEE or any particular standard. Table 1 illustrates IEEE single precision floating point format.
TABLE 1IEEE single precision floating point number format“Signed” bit indicates whether a floating point number is a positive (S=0) or negative (S=1) floating point number. For example, if the signed bit is 0, the floating point number is a positive floating point number. “Exponent” field (E) is represented by a power of two. For example, if a binary number is 10001.0010012=1.00010010012×24, then E becomes 127+4=13110=1000—00112. “Mantissa” field (M) represents fractional part of a floating point number.
For example, to add 2.510 and 4.7510, 2.510 is converted to 0x40200000 (in hexadecimal format) as follows:                Convert 210 to a binary number 102, e.g., by using binary division method.        Convert 0.510 to a binary number 0.12, e.g., by using multiplication method.        Calculate the exponent and mantissa fields: 10.12 is normalized to 1.012×21. Then, the exponent field becomes 12810, i.e., 127+1, which is equal to 1000—00002. The mantissa field becomes 010—0000—0000—0000—00002. By combining the signed bit, the exponent field and the mantissa field, a user can obtain 0100—00000010—0000—0000—0000—0000—00002=0x40200000.        Similarly, the user covert 4.7510 to 0x40980000.        Add 0x40200000 and 0x40980000 as follows:                    Determine values of the fields.                            i. 2.510                                     S: 0                    E: 1000—00002                     M: 1.012                                                 ii. 4.7510                                     S: 0                    E: 1000—00012                     M: 1.00112                                                                     Adjust a number with a smaller exponent to have a maximum exponent (i.e., largest exponent value among numbers; in this example, 1000—00012). In this example, 2.510 is adjusted to have 1000—00012 in the exponent field. Then, the mantissa field of 2.510 becomes 0.1012.            Add the mantissa fields of the numbers. In this example, add 0.1012 and 1.00112.            Then, append the exponent field. Then, in this example, a result becomes 0100—0000—1110—1000—0000—0000—0000—00002.            Convert the result to a decimal number. In this example, the exponent field of the result is 1000—00012=12910. By subtracting 12710 from 12910, the user obtains 210.            Thus, the result is represented by 1.11012×22=111.012. 1112 is equal to 710. 0.012 is equal to 0.2510. Thus, the user obtains 7.2510.                        
Although this example is based on single precision floating point numbers, the mechanism used in this example can be extended to double precision floating point numbers. A double precision floating number is represented by 64 bits, i.e., 1 bit for the signed bit, 11 bits for the exponent field and 52 bits for the mantissa field.
Traditionally, in a parallel computing system, floating point number additions in multiple computing node operations, e.g., via messaging, are done in part, e.g., by software. The additions require at per network hop a processor to first receive multiple network packets associated with multiple messages involved in a reduction operation. Then, the processor adds up floating point numbers included in the packets, and finally puts the results back into the network for processing at the next network hop. An example of the reduction operations is to find a summation of a plurality of floating point numbers contributed (i.e., provided) from a plurality of computing nodes. This software had large overhead, and could not utilize a high network bandwidth (e.g., 2 GB/s) of the parallel computing system.
Therefore, it is desirable to perform the floating point number additions in a collective logic device to reduce the overhead and/or to fully utilize the network bandwidth.