This invention relates to information processors, such as microprocessors, and, more particularly, to a method and apparatus which improves the operation of information processors having a vector processing unit by increasing the efficiency at which vectors are loading to registers and stored in memory.
The electronic industry is in a state of evolution spurred by the seemingly unquenchable desire of the consumer for better, faster, smaller, cheaper and more functional electronic devices. In their attempt to satisfy these demands, the electronic industry must constantly strive to increase the speed at which functions are performed by data processors. Videogame consoles are one primary example of an electronic device that constantly demands greater speed and reduced cost. These consoles must be high in performance and low in cost to satisfy the ever increasing demands associated therewith. The instant invention is directed to increasing the efficiency at which certain vectors are loaded in registers and stored to memory, as well as to decreasing the amount of memory required to store certain vectors.
Microprocessors typically have a number of execution units for performing mathematical operations. One example of an execution unit commonly found on microprocessors is a fixed point unit (FXU), also known as an integer unit, designed to execute integer (whole number) data manipulation instructions using general purpose registers (GPRs) which provide the source operands and the destination results for the instructions. Integer load instructions move data from memory to GPRs and store instructions move data from GPRs to memory. An exemplary GPR file may have 32 registers, wherein each register has 32 bits. These registers are used to hold and store integer data needed by the integer unit to execute integer instructions, such as an integer add instruction, which, for example, adds an integer in a first GPR to an integer in a second GPR and then places the result thereof back into the first GPR or into another GPR in the general purpose register file.
Another type of execution unit found on most microprocessors is a floating point unit (FPU), which is used to execute floating point instructions involving non-integers or floating point numbers. Floating point numbers are represented in the form of a mantissa and an exponent, such as 6.02xc3x97103. A floating point register file containing floating point registers (FPRs) is used in a similar manner as the GPRs are used in connection with the fixed point execution unit, as explained above. In other words, the FPRs provide source operands and destination results for floating point instructions. Floating point load instructions move data from memory to FPRs and store instructions move data from FPRs to memory. An exemplary FPR file may have 32 registers, wherein each register has 64 bits. These registers are used to hold and store floating point data needed by the floating point execution unit (FPU) to execute floating point instructions, such as a floating point add instruction, which, for example, adds a floating point number in a first FPR to a floating point number in a second FPR and then places the result thereof back into the first FPR or into another FPR in the floating point register file.
Microprocessor having floating point execution units typically enable data movement and arithmetic operations on two floating point formats: is double precision and single precision. In the example of the floating point register file described above having 64 bits per register, a double precision floating point number is represented using all 64 bits of the FPR, while a single precision number only uses 32 of the 64 available bits in each FPR. Generally, microprocessors having single precision capabilities have single precision instructions that use a double precision format.
For applications that perform low precision vector and matrix arithmetic, a third floating point format is sometimes provided which is known as paired singles. The paired singles capability can improve performance of an application by enabling two single precision floating point values to be moved and processed in parallel, thereby substantially doubling the speed of certain operations performed on single precision values. The term xe2x80x9cpaired singlesxe2x80x9d means that the floating point register is logically divided in half so that each register contains two single precision values. In the example 64-bit FPR described above, a pair of single precision floating point numbers comprising 32 bits each can be stored in each 64 bit FPR. Special instructions are then provided in the instruction set of the microprocessor to enable paired single operations which process each 32-bit portion of the 64 bit register in parallel. The paired singles format basically converts the floating point register file to a vector register file, wherein each vector has a dimension of two. As a result, part of the floating point execution unit becomes a vector processing unit (paired singles unit) in order to execute the paired singles instructions.
Some information processors, from microprocessors to supercomputers, have vector processing units specifically designed to process vectors. Vectors are basically an array or set of values. In contrast, a scalar includes only one value, such as a single number (integer or non-integer). A vector may have any number of elements ranging from 2 to 256 or more. Supercomputers typically provide large dimension vector processing capabilities. On the other hand, the paired singles unit on the microprocessor described above involves vectors with a dimension of only 2. In either case, in order to store vectors for use by the vector processing unit, vector registers are provided which are similar to those of the GPR and FPR register files as described above, except that the register size typically corresponds to the dimension of the vector on which the vector processing unit operates. For example, if the vector includes 64 values (such as integers or floating point numbers) each of which require 32 bits, then each vector register will have 2048 bits which are logically divided into 64 32-bit sections. Thus, in this example, each vector register is capable of storing a vector having a dimension of 64. FIG. 2 shows an exemplary vector register file 116 storing four 64 dimension vectors A, B, C and D.
A primary advantage of a vector processing unit with vector register as compared to a scalar processing unit with scalar registers is demonstrated with the following example: Assume vectors A and B are defined to have a dimension of 64, i.e. A=(A0 . . . A63) and B=(B0 . . . B63). In order to perform a common mathematical operation such as an add operation using the values in vectors A and B, a scalar processor would have to execute 64 scalar addition instructions so that the resulting vector would be R=((A1+B1) . . . (A63+B63)). Similarly, in order to perform a common operation known as Dot_Product, wherein each corresponding value in vectors A and B are multiplied together and then each element in the resulting vector are added together to provide a resultant scalar, 128 scalar instructions would have to be performed (64 multiplication and 64 addition). In contrast, in vector processing a single vector addition instruction and a single vector Dot_Product instruction can achieve the same result. Moreover, each of the corresponding elements in the vectors can be processed in parallel when executing the instruction. Thus, vector processing is very advantageous in many information processing applications.
One problem, however, that is encountered in vector processing, is that sometimes the nature of the vector data used by a particular application does not correspond to the typical vector for which the vector registers are designed. Specifically, the data used by a particular application may have less data values (i.e. a smaller dimension of actual data) in each vector than the total number of data values that the vector register can hold and for which the vector load and store instruction are designed. For example, a particular application may use vectors having only 30 real data values (i.e. A0 to A29), while the vector processing unit may be designed to operate on vectors having a dimension of 64 (i.e. A0 to A64). In order to properly execute vector load and store instructions, the vector registers must have 64 data values. As a result, even if the actual data for a particular application has only 30 data values, the vector register must still be loaded with 64 data values from memory. Thus, constants, such as a zeros, are loaded from memory into the lower order locations in the vector register that do not contain actual data (e.g. A30-A63). Moreover, when storing such a vector to memory, the actual data as well as the appended zeros must be stored to memory in order to comprise a complete vector of 64 data values. In other words, significant inefficiencies occur in vector processing when the actual data does not fill the entire vector, due to the fact that filler data, such as zeros, must be loaded along with the actual data in the vector register in order to completely fill the register. In addition, the filler data, which is not actual or useful data, must be stored to memory with the actual data when the vector register is stored to memory. Loading and storing all of the filler data (zeros in this example) constitutes a significant waste of bus bandwidth. In addition, this situation results in a significant waste of memory by having to store the filler data in memory as part of the vector.
As can be seen in FIG. 1a, the typical format for a vector load instruction 100 includes a primary op-code 102, a source address 104, and a destination register indicator 106. The primary op-code identifies the particular type of instruction, which in this instance is a vector load instruction. The op code may, for example, comprise the most significant 6 bits (bits 0-5) of the instruction. The source address 104 provides the particular address of the location in memory where the subject vector to be loaded by the instruction is located. The destination register indicator 106 provides the particular vector register in the vector register file in which the subject vector is to be loaded. It is noted that the vector load instruction format 100 of FIG. 1a is only exemplary and that prior art vector load instructions may have other formats and/or include other parts, such as a secondary op-code, status bits, etc., as one skilled in the art will readily understand. However, as explained above, regardless of the particular format of the instruction, the instruction still requires that a complete vector be loaded from memory to the vector register. Thus, in the above example, all 64 vector register locations must be loaded with data from memory, regardless of how many actual or real data values exist. Thus, for the conventional instruction format shown in FIG. 1a, the memory must contain 64 data values, regardless of the actual number of real data values.
Similarly, as can be seen in FIG. 1b, a typical vector store instruction 108 includes a primary op code 110, source register indicator 112, and a destination address 114. The primary op-code identifies the particular type of instruction, which in this instance is a vector store instruction. The op code may, for example, comprise the most significant 6 bits (bits 0-5) of the instruction. The source register 112 provides the particular vector register in the vector register file which is to be stored to memory by the instruction.
The destination address 114 provides the particular address in memory where the vector is to be stored by the instruction. It is noted that the vector store instruction format 108 of FIG. 1b is only exemplary and that prior art vector store instructions may have other formats and/or include other parts, such as a secondary op-code, status bits, etc., as one skilled in the art will readily understand. However, as explained above, regardless of the particular format of the instruction, the instruction still requires that a complete vector be stored to memory. Thus, in the above example, all 64 vector register locations would be stored to memory, regardless of how many actual or real data values exist in the vector.
As explained above, the conventional load and store instructions do not operate efficiently when the actual data does not correspond to the vector size is defined for a particular vector processing unit. Accordingly, a need exists for improving vector load and store instructions for cases in which the actual data values do not fill the entire vector, so that the operations associated therewith can be performed faster and more efficiently and so that less memory can be used.
The instant invention provides a mechanism and a method for enabling vector load and store instructions to execute more efficiently and with less memory usage by eliminating the need to load useless data from memory into vector registers and to store that same useless data in memory. The invention provides an improved instruction format which may be used in connection with any suitable type of data processor, from microprocessors to supercomputers, having a vector processing unit in order to improve the operational efficiency of vector load and store instructions in instances where the entire vector is not needed to store the data for a particular application.
In accordance with the invention, the improved vector load and store instruction formats have an embedded bit or a plurality of embedded bits that identify the end of the useful data in the vector which is the subject of the instruction. In this way, the load/store unit of the data processor can use the information provided by the embedded bit(s) to load only the actual data into the vector register, and to store only the actual data to memory. Thus, the improved instruction format eliminates the need to load filler data, such as zeros, from memory and to store the filler data to memory.
In accordance with a preferred embodiment of the invention, the improved load instruction format includes a primary op code, a source address, at least one position bit which indicates the end of the useful data in the vector, a value field providing a constant that is used by the load/store unit to set the remaining vector register locations to the constant, and a destination register indicator which provides the particular vector register in the vector register file that is to be loaded. Using this load instruction format enables the load/store unit (LSU) to only load the useful data from memory and to set the remaining vector locations to the constant.
In accordance with a preferred embodiment of the invention, the improved store instruction format includes a primary op code, a source register indicator which provides the particular vector register that is to be stored, at least one position bit that indicates the end of the useful data in the vector register, and a destination address in memory where the vector is to be stored. Using this store instruction format enables the load/store unit (LSU) to only store the useful data in the vector register to memory, thereby eliminating the need to store the constants or filler data present in the vector register.
The number of bits needed to indicate the end of the useful data within a particular vector depends on the particular dimension of the vector involved. For example, if the vector has a dimension of 64, then six bits are needed to provide a unique identifier for particular ending location of the useful data in the vector. In other words, if the dimension of the vector is 2n, then n bits are needed, in this embodiment, to indicate the ending location of the useful data.
In another embodiment of the improved load and store instructions of the instant invention, the position bit(s) and the value field are essentially combined into one bit which controls whether the entire vector register or just a portion thereof is loaded and stored, respectively. It is noted, however, that the invention is not limited to any particular implementation of the location indicator and the value field. Instead, the invention covers any suitable way in which the location of the end of the useful data within the vector can be represented or embedded in the bit format comprising the instruction, as well as any suitable way in which the load instruction can indicate to the load/store unit that a particular constant should be used in setting the unused elements in the vector register.
In a preferred embodiment, the invention is implemented on a microprocessor, such as the microprocessors in IBM""s PowerPC (IBM Trademark) family of microprocessors (hereafter xe2x80x9cPowerPCxe2x80x9d), wherein the microprocessor has been modified or redesigned to include a vector processing unit, such as a paired singles unit. For more information on the PowerPC microprocessors see PowerPC 740 and PowerPC 750 RISC Microprocessor Family User Manual, IBM 1998 and PowerPC Microprocessor Family: The Programming Environments, Motorola Inc. 1994, both of which are hereby incorporated by reference in their entirety.
In the modified PowerPC example described above, the paired singles operation may be selectively enabled by, for example, providing a hardware implementation specific special purpose register (e.g. HID2) having a bit (e.g. 3rd bit) which controls whether paired single instructions can be executed. Other bits in the special purpose register can be used, for example, to control other enhancement options that may be available on the microprocessor.
The invention also provides specific instruction definitions for paired singles load and store instructions. The invention is also directed to a decoder, such as a microprocessor or a virtual machine (e.g. software implemented hardware emulator), which is capable of decoding any of all of the particular instructions disclosed herein. The invention further relates to a storage medium which stores any or all of the particular instructions disclosed herein.