In the execution of programs V′=V1/2.5 on a computer, it is often desirable, and sometimes necessary, to compute power functions for data being processed. For instance, display devices, such as CRT monitors and LCD screens, exhibit a non-linear intensity-to-voltage response. A curve that characterizes this response corresponds roughly to a power function, e.g. L=V2.5, where V is the input voltage and L is the output intensity. The monitor in this situation is therefore said to have a “gamma” of 2.5. To correct for the gamma of the display, it is a common practice to multiply the input signal by a factor which is an inverse of the gamma. Thus, in this example, a gamma-corrected input voltage is computed to control the display device.
This gamma correction is computed for each pixel in the displayed image. In a high resolution display, a single image could consist of more than two million pixels. Thus, an appreciable portion of the computer's processing power is consumed by the calculation of power functions for the display of images.
Power functions are utilized in a variety of applications, in addition to gamma correction. In particular, multimedia applications employ power functions. For example, the decoding of audio files in the MPEG3 and MPEG4 formats requires the computation of power functions for quantization purposes. Similarly, a number of types of scientific computing employ power functions.
The computation of a power function is relatively expensive, from the standpoint of consumption of computer processing time. To reduce the overhead associated with the calculation of power functions, it has been a common practice to employ pre-calculated lookup tables that enable the calculated values to be retrieved in a time frame that is conducive to high throughput multimedia applications. However, certain limitations are associated with the use of lookup tables. First, lookup tables, by their nature, give limited precision results, and sometimes introduce substantial error into the calculation. Consequently, a degradation of signal quality may occur.
Second, each time that a call is made to a lookup table, the retrieval of a calculated value may cause other important data to be flushed from the cache memory of the computer. The loss of this data from the cache memory may result in performance problems elsewhere in the application being executed.
Third, a lookup table of the size necessary to reliably support operations such as gamma correction cannot be readily implemented in a vector processing architecture, also known as a single-instruction, multiple-data (SIMD) architecture. FIG. 1 illustrates an example of such an architecture. A computer system 10 includes a scalar floating point engine 12, and a vector floating point engine 14. The scalar engine 12 performs operations on a single set of data at a time, and hence is capable of producing one output value per operation. Conversely, the vector engine 14 operates upon arrays of data, and is therefore capable of producing multiple output results at once. For example, the vector processor 14 may contain registers which are each 128 bits in length. If values are represented in a 32-bit format, each register is capable of containing a vector of four data values. The vector processor operates upon these four data values simultaneously, for example adding them to a vector for other data values in another register, to produce four output values at once.
A memory 16 is accessible by both the scalar and vector processing engines, and can be used to transfer data between them, as well as to other system components (not shown). For operations that are not capable of being carried out in a vectorized manner, the input data values are transferred from the vector engine 14 to the memory 16. These data values are serially retrieved from the memory by the scalar processor 12, which performs the requested operation on one element of the input data vector at a time. The results of these scalar operations are stored in the memory 16, where they can be retrieved by the vector processor 14 to perform further operations.
It can be seen that, each time an operation must be performed in the scalar processor, the overall efficiency of the processing system suffers. First, the number of operations required to process the set of data increases by a factor of N, where N is the number of data values contained in a vector, e.g. 4 in the example given above. The efficiency is further diminished by the read and write operations needed to transfer data between the vector processor 14 and the scalar processor 12, via the memory 16.
Thus, it can be seen that a table lookup operation that is implemented in the scalar engine presents a significant bottleneck in the throughput rate for gamma correction and other operations that require a large number of power function calculations. It is desirable, therefore, to provide a technique for calculating power functions which eliminates the need to retrieve values from a large table of data. More specifically, it is desirable to provide such a technique which can be implemented within the vector processing engine, and thereby eliminate the inefficiencies associated with scalar operations.