As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multi-threading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A SIMD or vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, an SIMD or vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multi-threaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to an SIMD execution unit to process “vectors” of data points at the same time.
In addition, it is also possible to employ multiple execution units in the same processor to provide additional parallelization. The multiple execution units may be specialized to handle different types of instructions, or may be similarly configured to process the same types of instructions.
Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread in a multi-threaded architecture is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline(s) kept at a minimum. In addition, when multiple execution units are used, the issuance of instructions to such execution units may be handled by the same issue unit, or alternatively by separate issue units.
Another technique that may be used to improve the performance of a processor is to employ a microcode unit or sequencer to automatically generate instructions for execution by an execution unit. A microcode unit or sequencer responds to commands, e.g., via dedicated instructions in an instruction set, and in response, outputs a sequence of instructions to be executed by the processor. In much the same way that a software procedure can be used to perform a repeatable sequence of steps in response to a procedure call in a software program, a microcode unit or sequencer can be triggered by a command or instruction to perform a repeatable operation.
Microcode units or sequencers are particularly useful for performing long latency operations, i.e., operations that take a relatively long time to perform, and in the case of pipelined execution units, often require multiple passes through an execution pipeline. One example of a long latency operation for which the use of a microcode unit or sequencer might find benefit is in image processing, e.g., texture processing performed during rasterization of a graphical image. Rasterization is a process in 3D graphics where three dimensional geometry that has been projected onto a screen is “filled in” with pixels of the appropriate color and intensity. A texture mapping algorithm is typically incorporated into a rasterization process to paint a texture onto geometric objects placed into a scene.
In order to paint a texture onto a placed object in a scene, the pixels in each primitive making up the object are typically transformed from 3D scene or world coordinates (e.g., x, y and z) to 2D coordinates relative to a procedural or bitmapped texture (e.g., u and v). The fundamental elements in a texture are referred to as texels (or texture pixels), and being the fundamental element of a texture, each texel is associated with a single color. Due to differences in orientation and distance of the surfaces of placed geometric primitives relative to the viewer, a pixel in an image buffer will rarely correspond to a single texel in a texture. As a result, texture filtering is typically performed to determine a color to be assigned to a pixel based upon the colors of multiple texels in proximity to the texture mapped position of the pixel.
A number of texture filtering algorithms may be used to determine a color for a pixel, including simple interpolation, bilinear filtering, trilinear filtering, and anisotropic filtering, among others. With many texture filtering algorithms, weights are calculated for a number of adjacent texels to a pixel, the weights are used to scale the colors of the adjacent texels, and a color for the pixel is assigned by summing the scaled colors of the adjacent texels. The color is then either stored at the pixel location in a frame buffer, or used to update a color that is already stored at the pixel location.
Bilinear filtering, for example, uses the coordinates of a texture sample to perform a weighted average of four adjacent pixels, weighted according to how close the sample coordinates are to the center of the pixel. Bilinear filtering often can reduce the blockiness of closer details, but often does little to reduce the noise that is often found in distant details.
Trilinear filtering involves using MIP mapping, which uses a set of prefiltered texture images that are scaled to successively lower resolutions. The algorithm uses texture samples from the high resolution textures for portions of the geometry near to the camera, and low resolution textures for the portions distant to the camera. MIP mapping often reduces nearby pixelation and distant noise; however, detail in the distance is often lost and needlessly blurred. The blurriness is due to the texture samples being taken from a MIP level of the texture that has been pre-scaled to a low resolution in both the x and y dimensions uniformly, such that resolution is lost in the direction perpendicular to the direction that the texture is most compressed.
Anisotropic filtering involves taking multiple samples along a “line of anisotropy” which runs in the direction that the texture is most compressed. Each of these samples may be bilinear or trilinear filtered, and the results are then averaged together. This algorithm allows the compression to occur in only one direction. By doing so, less blurring often occurs in more distant features.
While the filtering calculations discussed above are often long latency operations, it has been found that conventional microcode units or sequencers suffer from a number of drawbacks that render such components sub-optimal for use in connection with performing filtering calculations in a processor, in particular within a multithreaded processor that utilizes multiple execution units. Conventional microcode units and sequencers, in particular, are typically upstream of, and thus coupled to the input of the instruction buffer logic for a processor.
In many designs, the same instruction buffer logic, which may include one or more instruction buffers, buffers the instructions to be executed by all of the execution units in a processor. Instruction fetch logic typically fetches instructions for the programs currently executing on the processor from memory (e.g., from an instruction cache) and stores those instructions in one or more instruction buffers. The instructions are then passed to the execution units for execution. When multiple execution units are served by the same instruction buffer logic, scheduling logic is used to issue instructions to appropriate execution units. In addition, when execution units are multi-threaded, scheduling logic manages the issuance of instructions from multiple threads.
A conventional microcode unit or sequencer, coupled upstream of the instruction buffer logic, suffers from a number of drawbacks that can reduce the performance of a processor that implements such a component. For example, most conventional microcode units or sequencers require several cycles to initialize a sequence, e.g., to calculate the address from which instructions for the sequence should be fetched. In addition, by being upstream of the instruction buffer logic, the decode of an instruction that triggers a microcode unit or sequencer will typically require later instructions already issued to an execution pipeline to be flushed before the desired sequence can start.
In addition, since a conventional microcode unit or sequencer is upstream of the instruction buffer logic that serves all of the execution units, whenever a sequence is be performed, typically all other instructions from the instruction buffer are blocked from executing on all execution units. Thus, when a sequence is being performed, a multi-threaded, multi-execution unit processor functions more or less as a single-threaded, single-execution unit processor, thus severely limiting the parallelism of the processor when sequences are being performed.
Therefore, a need exists in the art for a manner of improving the performance of long latency operations such as filtering operations in a multi-execution unit processor.