Applications of modern computer systems are requiring greater speed and data handling capabilities for uses such as multimedia and scientific modeling. For example, multimedia systems generally are designed to perform video and audio data compression, decompression, and high-performance manipulation such as 3-dimensional imaging. Massive data manipulation and an extraordinary amount of high-performance arithmetic, including vector-matrix operations such as DOT product, vector cross-product and vector transposition, are required for performing graphic image rendering such as computer-generated 3-dimensional images.
To perform large vector-matrix operations at high speed in a register based system, a method of fast, efficient vector register loading is required. In modern data processing systems, a critical speed path is between the cache and the register file. Therefore, the load and store functions must be optimized to provide the greatest speed possible. As is done in the prior art, the load is performed by retrieving a cache block from a cache and loading it into a register file. In most systems, the data is re-aligned or shifted from the arbitrary alignment in the memory to a proper vector alignment in the register by passing the data through an alignment multiplexer placed in the data path between the memory and the register file. This alignment is required because a vector stored in memory is a sequential string of bytes that may have no natural alignment in memory. The alignment multiplexer shifts the input data into alignment prior to being loaded into the registers to assure that the data, which is retrieved from memory on an address boundary, is properly aligned to the beginning of the vector in the register. Thus, one limitation of the prior art high-speed data processing systems is the inclusion of an alignment multiplexer circuit in a critical data path, which creates inefficient register loading and reduces clock frequency.
In addition, to perform large vector-matrix operations at high speed, a method of fast and efficient data permutation is required. In a register-based computer architecture, permutation of data is commonly done by reading data from a register and rearranging the data into another register. In the prior art, such permutation of data is performed by loading input bytes (i.e. an input data vector) into a first register and loading a control vector into a second register. The control vector indicates how the input data vector is to be rearranged in an run 22 output register to implement a given function. Such systems limit the processor to performing unary serially dependent functions (e.g. Y=f1(f2(f3(f4(. . . f(A) . . . ), because only a single input operand is available. To perform a serially dependent vector computation, the control register is loaded with a control vector to perform the desired function and the input register is loaded with the previous result operand of the function chain. Therefore, another significant limitation of the prior art high-speed data processing systems is that there is no possibility of performing a mathematical operation of a serially dependent chain of binary (or higher N-ary) functions (e.g. Y=f1(f2(f3(f4(. . . f(A, B) . . . ) and thus severely limits the types of vector operations that the prior art processors can perform.
As will now be appreciated, it would be advantageous to provide a data processing system that allows a method of fast and efficient data permutation and register loading. Such a system would provide aligned data vectors within the register file without requiring an alignment multiplexer and therefore would increase processing speed. Further, it would be desirable for such a system to have the capability to execute a serially dependent chain of N-ary functions.