I. Field of the Invention
This invention relates generally to computer technology, and more particularly, to improving processor performance in a computer system.
II. Background Information
Developers are continually trying to improve processor performance and program execution time. Processor performance and program execution time can be improved using hardware and software techniques. Hardware techniques include pipelining where the fetch, decode, and execute logic stages are overlapped such that the processor operates on several instructions simultaneously. Software techniques include having a compiler optimize the program code. Normally, passes in the compiler transform programs written in a high-level language (e.g., the high-level programming language may be the “C” computer programming language) into progressively lower-level representations, eventually reaching the instruction set. The instruction set is the collection of different instructions that the processor can execute (e.g., the Intel Architecture 32-bit (“IA-32”) instruction set from Intel Corporation).
An optimizing compiler is a compiler that analyzes its output to produce a more efficient (smaller or faster) instruction set. The optimizing compiler may use multiple passes to convert high-level code to low-level code (the instruction set). One way that the optimizing compiler improves program execution time is by reducing the code footprint (number of instructions generated into assembly language from the high-level program code). Reducing the code footprint improves program execution time since the program code has fewer instructions, and thus fewer instructions are fetched from a memory unit in the fetch stage (the memory unit's speed is slower than the processor's speed) and fewer instruction are decoded in the decode stage.
Reducing the code footprint also improves processor performance as a cache memory is better utilized. Almost all modern processors use cache memory. Cache memory is a special memory subsystem in which frequently used data values are duplicated for quick access. Cache memory is useful when main memory accesses are slow compared with processor speed, because cache memory is faster than main memory. Cache memory has to be efficiently utilized in order to obtain a high ratio of “hits” (e.g., the data is found in the cache memory and thus access to the main memory is avoided) to “misses” (e.g., the main memory is accessed in order to obtain the data). Since a cache miss results in additional time to retrieve the data into the cache, processing time is lost waiting for this data to arrive when a cache miss occurs. An instruction cache is cache memory that stores instructions fetched from main memory. Reducing the code footprint allows more of the instructions that make up the program code to be stored in the instruction cache, thus increasing the likelihood of a cache hit and the resulting increase in processor performance. Other means of instruction storage can benefit from code footprint reduction. For example, a trace cache stores instructions that have already been executed. By reducing the code footprint, the number of executed instructions stored in the trace cache increases and thus increases the likelihood of cache hits and the resulting increase in processor performance.
In a pipeline implementation, the bottleneck tends to be feeding an execution unit (the fetch and decode stages feed the execution unit) rather than executing the instructions themselves (this occurs in the execution stage). If two or more instructions are packed into the storage space of a single instruction, then multiple instructions can be fetched and decoded in the time that it takes to fetch and decode a single instruction resulting in the execution unit being fed at a faster rate and thereby improving the processor performance.
A clock cycle determines how quickly the processor can execute instructions and is used to synchronize the activities of various components of a computer system. The length of the clock cycle is determined by the time required for the slowest instruction to execute. Typically, the execution unit (in the execution stage) executes one instruction per clock cycle (i.e., performs one operation per clock cycle). However, because the clock cycle is tailored for the slowest instruction, many of the instructions finish executing long before completion of the clock cycle. Because the clock cycle is tailored toward the slowest instruction, one instruction performing two operations or two instructions (each instruction performing only one operation) may be executed in one clock cycle if a specialized execution unit is available that can execute both operations simultaneously. If the specialized execution unit is employed, then upon decoding one or more instructions that can benefit from the specialized execution unit, those instructions can be tagged for execution on the specialized execution unit.
For the foregoing reasons, there is a need to combine instructions whenever possible in order to minimize the program size and thus improve processor performance and program execution time. There is a also a need for a specialized execution unit that can process two operations in one clock cycle.