Modern computer processor architectures typically rely on multiple functional units to execute instructions from a computer program. An instruction or issue unit typically retrieves instructions and dispatches, or issues, the instructions to one or more execution units to handle the instructions. A typical computer processor may include, for example, a load/store unit that handles retrieval and storage of data from and to a memory, and a fixed point execution unit, or arithmetic logic unit (ALU), to handle logical and arithmetic operations.
Whereas earlier processor architectures utilized a single ALU to handle all logical and arithmetic operations, demands for increased performance necessitated the development of superscalar architectures that utilize multiple execution units to handle different types of computations. Doing so enables multiple instructions to be routed to different execution units and executed in parallel, thereby increasing overall instruction throughput.
One of the most common types of operations that can be partitioned into a separate execution unit is floating point arithmetic. Floating point calculations involve performing mathematical computations using one or more floating point values. A floating point value is typically represented as a combination of an exponent and a significand. The significand, which may also be referred to as a fraction or mantissa, represents the digits in a floating point value with a predetermined precision, while the exponent represents the relative position of the binary point for the floating point value. A floating point execution unit typically includes separate exponent and significand paths, with a series of adders incorporated into the exponent path to calculate the exponent of a floating point result, and a combination of multiplier, alignment, normalization, rounding and adder circuitry incorporated into the significand path to calculate the significand of the floating point result.
Floating point execution units may be implemented as scalar execution units or vector execution units. Scalar execution units typically operate on scalar floating point values, while vector execution units operate on vectors comprising multiple scalar floating point values. Vector floating point execution units have become popular in many 3D graphics hardware designs because much of the data processed in 3D graphics processing is readily vectorizable (e.g., coordinates of objects in space are often represented using 3 or 4 floating point values).
When a separate floating point execution unit is utilized in a computer processor, other arithmetic and logical operations are typically handled in a smaller, less complex fixed point execution unit. Fixed point arithmetic, in contrast with floating point arithmetic, presumes a fixed binary point for each fixed point value. Arithmetic operations are typically performed more quickly and with less circuitry than required for floating point execution units, with the tradeoff being reduced numerical precision. Floating point operations can also be compiled into multiple fixed point operations capable of being executed by a fixed point execution unit; however, a floating point execution unit often performs the same operations much more quickly and using less instructions, so the incorporation of a floating point execution unit into a processor often improves performance for many types of computationally-intensive workloads.
Most high performance processors have therefore migrated to an architecture in which both fixed point and floating point execution units, and in some instances, both scalar and vector fixed point and/or floating point execution units, are incorporated into the same processor, thereby enabling a processor to optimally handle many different types of workloads. For other types of computer processors such as mobile processors, embedded processors, low power processors, etc., however, the inclusion of multiple execution units may be problematic, often increasing cost and requiring excessive circuitry and power consumption.
Nonetheless, a number of different types of calculations still present performance problems for conventional processors. For example, image recognition is fast becoming an important feature in many computer applications. Image recognition, however, often requires substantial processing power, and as a result, the implementation of high performance image recognition algorithms can be a challenge, particularly for mobile devices and other low power devices where power consumption and costs can be paramount concerns.
One commonly used operation used in many image recognition algorithms, for example, is a packed sum of absolute differences operation. A sum of absolute differences algorithm, for example, may be used to measure the similarity between image blocks by taking the absolute difference between corresponding pixels in two blocks being compared with one another. The differences are then summed to create an indication of block similarity.
The “packed” in a packed sum of absolute differences operation refers to how colors are stored in a packed format in memory. A common format is R8G8B8A8, which is 32 bits per pixel, where there is 8 bits for the red channel, 8 bits for the green channel, 8 bits for the blue channel, and 8 bits for the alpha channel (typically a transparency mask). Images are often loaded from memory in packed format, then converted to floating point where high precision algorithms can be performed (e.g., filtering), with the results converted back to a packed format and stored back to memory. Where a processor architecture supports a packed sum of absolute differences operation, however, the sum of absolute differences is calculated while the data is still in a packed format, thereby eliminating the need to first convert the data to floating point and then re-pack, often yielding a substantial performance improvement.
Packed sum of absolute differences operations may be supported in a processor architecture using a dedicated vector fixed point instruction, which may be executed in a single pipeline pass using a vector fixed point execution unit. However, some processor architectures may only have a scalar fixed point and/or scalar floating point execution unit, or may only have a vector floating point execution unit. Alternatively, in some processor architectures, a vector fixed point execution unit may be included, but for performance concerns, a second vector fixed point execution unit may also be needed to ensure that dual instruction issue can be performed. Therefore, in all of these cases, additional circuit area is typically required to support packed sum of absolute differences instructions, which necessarily increases power consumption and chip cost, often precluding many processor designs from incorporating native support for packed sum of absolute differences operations.
A need therefore continues to exist in the art for an improved manner of efficiently and cost-effectively handling packed sum of absolute differences operations in a processor architecture.