1. Field of the Present Invention
The present invention relates to data processing systems, and more specifically, to methods and systems that improve the performance of the data processing system.
2. History of Related Art
The exploitation of the computer by businesses as well as the individual user has resulted in an ever increasing demand for better and faster technology related thereto. One such means for increasing the speed and efficiency of the computer system is cache memory. In general, cache memory is a small fast storage memory that is used for accessing the most commonly used data, and is based upon the principle called locality of reference.
Computer systems using cache memory have increased their overall efficiency. However, the retrieval of the information from the cache via load instructions requires at least one pipeline cycle. Further, depending upon the pipeline configuration, the data (datum) may not be available for use by a following instruction until at least another cycle after the previous cache access. Thus, it can be seen from the above that regardless of the pipeline configuration used, at a minimum, one cycle is required in order to be able to access and use the value of a load by subsequent instructions.
Consequently, the execution of both the load and subsequent dependent instruction(s) during the same cycle is not possible. Obviously, this type of restriction can become a performance bottle neck in multiple issue and execute machines, such as superscaler, that attempt to execute several independent instructions within the same cycle via multiple functional units.
The potential for the bottle neck becomes more apparent upon the realization that a program has a typical distribution of instructions on the order of 20% for loads and 50% for subsequent instructions that depend upon the result of the loads (load use interlock). In example thereof, a superscaler machine that can issue and execute two instructions every cycle, i.e. an ideal CPI of 0.5 (with infinite cache), would result in a performance bottle neck via the load serialization of (0.5+0.2.times.0.5.times.1).div.0.5=1.2 times. If the result of the load instruction is not available for use by a subsequent dependent instruction until another cycle, then the bottle neck can be as much as (0.5+0.2.times.0.5.times.2).div.0.5=1.4 times. In other words, the serial execution of load instruction with the subsequent dependent instruction can make the above noted superscaler machine execute between 20 to 40% slower. The above example assumes that no other independent instruction(s) could have been scheduled in place of the interlocked instruction.
Techniques such as code rescheduling have been developed to reduce such bottle necks by 20 to 50%. Even with the use of such techniques, however, the bottle neck is still significant. Specifically, if an extra cycle is needed for a load use interlock, then the delay is on the order of (0.5+0.2.times.0.2.times.2).div.0.5=1.2 times (20%), or (0.5+0.2.times.0.2.times.1).div.0.5=1.1 times for no interlock.
It would, therefore, be a distinct advantage to have a method and apparatus for reducing the cycle times associated with load use interlock. The present invention provides such an apparatus and method.