An FPU (floating point unit) is a type of coprocessor embedded in a more general microprocessor that manipulates numbers more quickly than the general, basic microprocessor. A coprocessor refers to a computer processor which assists the main processor by performing certain special functions, usually much faster than the main processor could perform them in software. The coprocessor often decodes instructions in parallel with the main processor and executes only those instructions intended for it. For example, an FPU coprocessor performs mathematical computations, particularly floating point operations. FPU coprocessors are also called numeric or math coprocessors. An FPU is often built into personal computers and servers which perform special applications such as graphic image processing or display. For example, in addition to math coprocessors, there also can be graphics coprocessors for manipulating graphic images.
An FPU coprocessor is designed to handle large, complex mathematical operations using floating point numbers. Floating point numbers are numbers that are carried out to a certain decimal position (such as 3.141598). In a digital system, floating point numbers are typically expressed in binary format (expressed in powers of 2). They are stored in three parts: the sign (plus or minus), the significant or mantissa which represents the fraction part of sign-magnitude binary significand with a hidden integer bit, and the exponent or order of magnitude of the mantissa, which determines the place and direction to which the binary point floats.
Since an FPU is used for highly complex, computation-intensive operations, its performance is closely tied to its throughput (e.g., the number instructions that are processed in a given period of time) and speed. For many digital signal processing needs such as for an RF encoder/decoder, audio/video compression encoder/decoder, or a cryptographic encoder/decoder, high-speed floating-point computations involving vector multiplication and addition operations are a critical design factor. Unfortunately, conventional FPUs fail to deliver the high vector processing speed required by high-performance digital signal processing systems. Some conventional vector FPUs use a pipelined architecture to implement vector multiplication and addition operations in order to improve the throughput. However, even with the pipelined architecture, conventional vector FPUs do not deliver the processing speed demanded by the high-performance digital signal processing systems because of their architectural limitations. For example, conventional FPUs, even if they are pipelined, execute the multiplication and addition operations in series in the pipeline. Due to the sequential execution of the multiplication and the addition, the pipeline latency in a conventional vector processor cannot be reduced below a certain point because the pipeline includes both multiplication and addition stages.
Further, conventional FPUs lack flexibility and are cost-inefficient. For example, a vector multiplication requires row-column multiplication of multi-dimensional input vector operands. This operation requires a large number of memory accesses of various kind including sequential read accesses and repeat read accesses. Often the conventional FPUs do not have a flexible architecture to handle the various types of memory accesses in an efficient manner. Also, the cost of constructing a flexible architecture FPU can be prohibitively expensive using conventional technology.
In view of the foregoing, it is highly desirable to provide a flexible, cost-efficient FPU. It is also desirable to provide a high-speed FPU with a throughput to meet the data processing speed required by high-performance digital signal processing systems without losing flexibility and cost-efficiency.