1. Field of the Invention
The present invention relates to the field of computer systems and, in particular, to a system and micro-architecture for writing select, non-contiguous bytes of packed data in a single instruction.
2. Background Information
Computer technology continues to evolve at an ever increasing rate. Gone are the days when the computer was merely a business tool primarily used for word-processing and spreadsheet applications. Today, with the evolution of multimedia applications, computer systems have become a common home electronic appliance, much like the television and home stereo system. Indeed, the line between computer system and other consumer electronic appliance has become blurred as multimedia applications executing on an appropriately configured computer system will function as a television set, a radio, a video playback device, and the like. Consequently, the market popularity of computer systems are often decided by the amount of memory they contain and the speed at which they can execute such multimedia applications.
Those skilled in the art will appreciate that multimedia and communications applications require the manipulation of large amounts of data represented in a small number of bits to provide the true-to-life renderings of audio and video we have come to expect. For example, to render a 3D graphic, large amounts of eight-bit data must be similarly processed. Prior art processors would have to issue a number identical instructions to move each byte of data in order to render such a 3D graphic. To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, the Single Instruction, Multiple Data (SIMD) processor architecture has been developed to improve computer system performance by processing several bytes of information in a single instruction.
SIMD architectures take advantage of packing many bytes of data within one register or memory location, employing a data type known in the art as packed data. Packed data generally refers to the representation of multiple numbers by a single value. For example, four eight-bit integer numbers may be represented by a single 32-bit number having four eight-bit segments. Thus, a single instruction from the SIMD instruction set may be used to process four bytes of data that would have required three additional instructions using prior art instruction sets. Accordingly multiple operations can be performed on separate data elements with one instruction, resulting in significant performance improvements.
Theoretically, with its ability to process multiple bytes of data with one instruction, it has been shown that the SIMD processor architecture is capable of performance improvements of up to 4.times. over non-SIMD processor architectures, while improvements of 1.5.times. to 2.times. are more typical. There are a couple of reasons why the theoretical 4.times. performance improvement has not been reached. One reason is the manner in which prior art SIMD processor architectures process packed data. That is, the 4.times. performance mark of the SIMD processor architecture can only be achieved when the entire set of data embedded within packed data are to be similarly processed by the instruction. In instances where select, non-contiguous bytes of the packed data are to be processed, inefficiencies result due to the need for multiple instructions and additional cache management. For example, a prior art move operation (MOVQ SRC1, DEST) typically moves packed data identified by a first operand (SRC1) to a location identified by a second operand (DEST). As shown, the entire packed data set identified by SRC1 will be moved to the location identified by DEST. Moving select, non-contiguous bytes of the packed data identified by SRC1 would require multiple instructions.
One example of a prior art approach to moving select, non-contiguous bytes of packed data might be accomplished by the test, branch and write series of instructions. In accordance with this prior art approach, each byte of the packed data is transferred to an integer register, along with a corresponding mask bit. The mask bit is tested and a branch is used to either write or bypass writing the byte to memory. This approach requires many more instructions, and also suffers a performance penalty for poor branch prediction.
Another example of a prior art approach to moving select, non-contiguous bytes of packed data is the conditional move. In the conditional move, each byte of the packed data is transferred to an integer register, along with a corresponding mask bit. The mask bit is tested and used with a conditional move instruction to write the byte to memory. This approach avoids the performance penalties of the branch misprediction identified above, but still requires a number of instructions to identify and move select, non-contiguous bytes of the packed data.
Moreover, in addition to the performance loss incurred with the necessity of multiple instructions, the cache management associated with these multiple instructions also results in a performance loss of prior art SIMD processor architectures. That is, those skilled in the art will appreciate that a move instruction is a series of write instructions at the micro-architecture level and, as such, require a corresponding number of writes to the local processor cache before updating the desired register or main memory location. Thus, the prior art move instructions often result in a number of intermediate writes to the local processor cache, wherein much of the data written to the cache will never again be accessed by the processor, resulting in wasted cache resources.
Thus, a need exists for an improved SIMD architecture which utilizes the packed data format in a more effective manner. Those skilled in the art will appreciate that the teachings of the present invention achieves these and other desired results, as will become apparent from the description to follow.