This invention relates generally to computer systems, and more particularly to parallel processing of functional operations in computer systems.
As it is known in the art, computer systems generally include at least one central processing unit (CPU), a main memory, and a system bus coupling the main memory to the CPU. Often the computer system is controlled through an operating system software program and the computer system will also generally include a compiler program to translate a user program, typically written in a high-level programming language, into a machine-language program. The machine language program is an executable file of instructions which the CPU is capable of executing. The operating system typically stores the executable file which contains instructions in main memory where the CPU may access it via the system bus.
The CPU generally includes a processor chip and retrieving logic. The processor directs the retrieving logic to retrieve instructions from executable files in main memory. An instruction cache memory (I-cache) may also be resident on the CPU and may be used by the retrieving logic to temporarily store instructions prior to and after their execution. The processor also directs control logic on the CPU to unload instructions from the I-cache and execute them.
An instruction is a group of bits partitioned into an operational code (i.e., op-code) which defines an operation or multiple operations for the control logic of the processor to execute. The operational code further provides addresses of registers which contain the operands of the operation (i.e., data inputs to be used during execution of the operation) and the address of a register to be used to store the result of the operation. The operations specified by the op-code include functional operations (i.e., add, subtract, multiply, shift, complement, etc.) or null operations (i.e., NOP).
In the past, CPUs were capable of executing only one operation per processor clock cycle. Therefore, the compiler program translated the user program into an executable file of instructions each of which included only a single operation. Thus, the processor executed the operations serially (i.e., one operation at a time).
In order to increase the performance of the computer system, CPUs were developed which included processors and control logic capable of executing a predetermined number of operations simultaneously (i.e., the predetermined number of operations would be executed during a single processor clock cycle). Thus, the CPU executes the predetermined number of operations in parallel (i.e., parallel processing).
One problem with executing operations in parallel is that values (i.e., operands) used by the hardware to complete operations may not be available when instructions are executed in parallel. For example, if a first operation provides a result and stores the result in a register X and a second operation needs the result of the first operation stored in register X, these two operations may not be issued in parallel. This situation is an example of a so called data precedence or dependency constraint. That is, here the second operation requires the results of the first operation, and thus, they can not be issued in parallel with the first operation. Other data precedence or dependency constraints include latencies or delays associated with the results of one operation which are necessary to another operation, and limited resources which require one operation to be scheduled in a subsequent instruction. Thus, all of the operations which make up a user program may not be executed in parallel due to data precedence constraints.
In a multiple instruction issue computer system, generally either software (i.e., a compiler program) or hardware (i.e., control logic) is used to determine which functional operations may be executed in parallel. An example of a system which uses software is a Very Long Instruction Word (VLIW) parallel processing computer system. The compiler program of a VLIW computer system determines which functional operations may be executed in parallel while translating the user program into an executable file. When providing and storing the executable file in main memory, if the compiler determines that it is necessary for an instruction to include less than the predetermined number of functional operations due to data precedence constraints of a particular functional operation, a null operation (i.e., NOP) is substituted for the particular functional operation in the current instruction and the particular functional operation is put in a subsequent instruction. A NOP does not change the status of the control logic and hence, does not contribute to the result of the executable file being executed. The control logic in a VLIW machine merely unloads the predetermined number of operations (including functional operations and NOPs) from the I-cache and executes them in a single processor clock cycle without checking for data precedence constraints.
Due to the addition of NOPs, the size of the resulting executable file in a VLIW computer system is larger than an executable file containing only functional operations. This increase in size wastes main memory space and subsequently I-cache space when the retrieving logic retrieves instructions from main memory.
The addition of NOPs may also require the processor to make more accesses of main memory. The processor causes the control logic to execute instructions stored in the I-cache. If an instruction is not in the I-cache (i.e., an I-cache miss), the processor causes the retrieving logic to access main memory to retrieve the instruction. Generally, a cache line or predetermined amount of data is retrieved from main memory which may include many instructions. The retrieved instructions which may include NOPs are stored in the I-cache. The I-cache of a processor is typically considerably smaller than main memory, and therefore, multiple transactions, each of which will retrieve a cache-line of data, may need to be initiated with main memory to retrieve an entire executable file especially where the executable file has been enlarged by the addition of NOPs.
Thus, the NOPs added to the executable file by the compiler waste I-cache space and may require the processor to make additional main memory accesses to retrieve all the functional operations necessary to accomplish the results desired by the original user program. These additional main memory accesses may degrade the performance of the computer system where other CPUs or devices are resident on the system bus and capable of gaining control of the system bus, but are denied control of the system bus while a CPU conducts the additional main memory accesses.
Another difficulty with VLIW computer systems occurs when new hardware implementations of the CPU are developed such that a different predetermined number of operations may be executed in parallel. Because the control logic in a VLIW computer system relies on the compiler to determine which functional operations may be executed in parallel, user programs must be re-compiled to account for the change in the predetermined number of operations which may be executed in parallel. This also prevents the use of functionally compatible CPUs having different hardware implementations on the same system bus, because they cannot share executable files.
One technique used in VLIW computer systems as an alternative to adding NOPs to the instructions in main memory is for the compiler program to translate a user program into an executable file including only functional operations and sets of control bits. A set of control bits is associated with each group of operations designated by the compiler to be executed in the same instruction (i.e., same processor clock cycle) and provides information to the retrieving logic on the CPU as to where NOPs should be loaded in the I-cache such that each instruction in the I-cache contains the predetermined operations. This technique may waste less main memory space.
However, the retrieving logic is necessarily made more complicated in that the control bits must be interpreted to properly load the I-cache. Additionally, the addressing of instructions in main memory is made more complex, because each instruction in main memory contains a variable number of operations, and thus, the amount main memory space used to store each instruction is different and must be addressed differently. It may be possible to use a system bus over which a fixed amount of data (i.e., a cache-line of data) is always retrieved from main memory, however in this case, that fixed amount of data may contain a variable number of instructions. The retrieving logic may either automatically retrieve a small amount of data which will contain a maximum number of instructions that the I-cache is always able to store, calculate the address range of an amount of data which will contain a number of instructions that the I-cache is currently able to store, or drop that portion of the data which is retrieved but which the I-cache cannot store (i.e., the data is still available in main memory, but not stored in the I-cache). In either of these cases, memory bandwidth may be degraded. This technique also does not solve the waste of I-cache space.
A superscalar parallel processing computer system uses hardware (i.e., control logic) to determine which functional operations may be executed in parallel. A compiler program is used to translate a user program into an executable file, however, the compiler need not determine which functional operations may be executed in parallel. Thus, if the compiler does not determine which functional operations may be executed in parallel, only functional operations necessary for the user program will be stored in main memory and subsequently stored in the I-cache of a CPU. As the control logic unloads functional operations from the I-cache in order to provide an instruction to be executed in the next processor clock cycle, the control logic determines whether there are data precedence constraints such that the functional operation must be executed in a subsequent instruction (i.e., processor clock cycle).
To improve performance a compiler program which checks for data precedence constraints may be used with a superscalar computer system. Since the control logic is generally only capable of examining a small window of operations to determine data dependency, a compiler program which examines considerably more functional operations or the entire program can be used to increase superscalar computer system performance. Thus, the compiler program translates the user program into an executable file with functional operations stored so as to increase the number of operations which the control logic will permit to execute in parallel. Due to the data precedence checking conducted by the control logic, the compiler may, but need not include NOPs, therefore, less main memory and I-cache space is wasted and user programs do not have to be re-compiled to be used on new hardware implementations.
However, the control logic for such a superscalar computer system is necessarily complex and requires a large amount of hardware to implement. Further, the data precedence determination made while unloading functional operations which will make up the next instruction from the I-cache generally consumes a significant amount of time. Thus, the processor clock cycle time period may need to be increased to allow for data precedence checking of operations being unloaded from the I-cache before the instruction is executed where one instruction is executed during each processor clock cycle (i.e., pipe depth equals one) or the number of clock cycles required to execute an instruction may be increased to allow for one or more clock cycles for unloading the operations from the I-cache and checking for data precedence constraints (i.e., pipe depth is greater than one). In either case, the performance of the computer system is reduced.
Another difficulty with superscalar computer systems is limited data lifetime. The control logic will not allow register values to be used for multiple purposes during single processor clock cycles. For example, the control logic will not allow the parallel execution of one functional operation which seeks to store a new value in a register with another functional operation which seeks to use the current value of the same register. The control logic in this situation would incorrectly determine a data precedence constraint.
As an example, if a portion of an executable file contains a multiplication operation (i.e., MUL1) and a subsequent subtraction operation (i.e., SUB1) and SUB1 is to operate on the current value being stored in a register five (i.e., R5) prior to the result of MUL1 being stored in R5, upon making this determination, the compiler program of a VLIW computer system will properly provide both SUB1 and MUL1 in a single instruction (i.e., IN1). Thus, when the control logic of the VLIW computer system executes IN1 without checking for data precedence constraints, SUB1 will operate on a current value of R5 and then MUL1 will load R5 with a new value. The control logic of a superscalar computer system limits the data lifetime of register values and would determine incorrectly that SUB1 needs to operate on the value stored in R5 by the MUL1 operation, and therefore, the control logic would not execute the SUB1 operation in IN1, but rather, in a subsequent instruction. Thus, the SUB1 operation would incorrectly operate on the value stored in R5 by MUL1.
One approach to allow the superscalar computer system to correctly execute the user program would be to have the result of the MUL1 operation stored in another register. However, this could increase the number of necessary registers.