The present invention relates to the field of computer processor architecture, and more particularly, to a system and method for dynamically translating instructions in an object-level instruction stream to improve processor performance.
Improved performance for computer processors is a continuing goal for computer designers. In general, improved performance may be defined as executing more instructions in less time. Steady increases in processor speeds have achieved gains in the execution rate of instructions. Also, efforts to increase parallelism, by which is meant the simultaneous or nearly simultaneous processing of multiple instructions, have resulted in increased processor throughput.
Pipelining is a known technique for achieving a form of parallel processing. In pipelining, instructions are divided into independent stages, each requiring a separate hardware element to execute. Such pipeline stages might consist of an address translation stage, an operand fetch stage, an execution stage, and an operand write back stage. The stages are overlapped so that multiple instructions can be in progress at the same time, although each is in a different stage. For each clock cycle, a stage of each instruction in progress is executed, and, absent certain slowing factors, an average execution rate of one instruction per clock cycle is possible.
Developments in processor architectures which have advanced the parallelism afforded by pipelining include superscalar technology. Superscalar processors provide multiple execution units, allowing for multiple pipelines. Additionally, superscalar processors can alleviate slowdowns due to pipelining “clogs” by performing out-of-order execution. In pipelining, dependencies between instructions can cause wasted clock cycles. Superscalar processors look for instructions without dependencies and perform out-of-order processing of such instructions.
Superscalar processors can also perform speculative processing to reduce delays associated with resolving branch conditions in instructions. In such speculative processing, the processor performs branch prediction, which is a mechanism by which the processor guesses whether or not a particular branch will be taken, and, in advance, fetches the appropriate instructions accordingly.
Examples of superscalar processors include processors having RISC (Reduced Instruction Set Computing) architecture. To boost processing speed, typically RISC processors use uniform instruction lengths, minimize the number of instructions that access memory, and have more general purpose registers than CISC (Complex Instruction-Set Computing) processors.
RISC processor designs and techniques such as pipelining represent approaches to improving the performance of computer processors. However, opportunities exist for further improvement.
Typically, in a RISC-architected processor, parallel processing is effected by dispatching instructions to be simultaneously executed on separate execution units. Ideally, this results in increased processor throughput. However, in practice, an instruction in one pipeline often must wait for the results of an instruction in another pipeline. This represents an opportunity for improved performance, since multiple instructions which have, for example, commonalities between operands, and which would otherwise be serially dispatched to separate execution units can be combined into a single instruction which is dispatched to a single execution unit, thereby extending the number of instructions which can be dispatched per clock cycle.
Opportunities also exist to better utilize the resources afforded by improved processor hardware. Advances in processor capabilities sometimes outpace the ability of software to exploit them to the greatest extent possible. One such example is that of software designed to work with an 8-bit register architecture when processor architectures providing for faster 16-bit registers are available. Programs compiled to run in the older, 8-bit architected system represent “legacy code” which is inefficient given 16-bit capabilities. To take advantage of available 16-bit architecture, methods employed have included recompiling legacy code to create instructions implementing 16-bit operands. Alternatively, architectures have been implemented which allow both 8-bit and 16-bit instructions to be executed. However, such architectures retain inefficiency in that, to the extent that they are used to run 8-bit legacy code, the 16-bit capabilities are being underutilized.
A similar situation occurs in VLIW (Very Long Instruction Word) processors. Legacy code underutilizes VLIW processing capabilities, or may not even be executable at all on VLIW processors.
Prior art combination instructions are known. They group instructions that are not dependent on each other for simultaneous dispatch to available execution units. Dependent instructions must be withheld to allow prior instructions to produce needed operands.
In view of the foregoing, further improvement in techniques for increasing parallelism and in utilization of available processor upgrades is needed.