1. Technical Field
The present invention relates in general to data processing systems and in particular to a method and system for ordering load instructions. Still more particularly, the present invention relates to a method and system for implementing just-in-time delivery of data requested by load instructions.
2. Description of the Related Art
In conventional symmetric multiprocessor (SMP) data processing systems, all of the processors are generally identical. The processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system may comprise a system memory, a plurality of processing elements that each include a processor and one or more levels of cache memory and a system bus coupling the processing elements to each other and to the system memory.
Conventional SMP data processing system processors have a number of execution units. Superscalar multiprocessors typically have more than one of each execution unit. They typically have two floating point units (FPUs), two fixed point units (FXUs) and two load/store units (LSUs). The processors are designed for high frequency and their corresponding internal caches are typically very small in order to operate with the high frequency processor. In part due to their relatively small size, these internal caches sustain a large number of cache misses during requests for data. Data is thus stored in lower level (L2) caches to maximize processing speed. The processors typically send multiple load requests simultaneously or within close proximity to each other. This is particularly true in superscalar processors with multiple LSUs.
Traditionally, processors execute program instructions in order. With state-of-the-art processors, out-of-order execution of instructions is often employed to maximize the utilization of execution unit resources within the processor, thereby enhancing overall processor efficiency. Further, in these state-of-the-art processors that support out-of-order execution of instructions, instruction may be dispatched out of program order, executed opportunistically within the execution units of the processor, and completed in program order. The performance enhancement resulting from out-of-order execution is maximized when implemented within a superscalar processor having multiple execution units capable of executing multiple instructions concurrently.
Processors today often run numerous cycles ahead of the instruction stream of the program being executed. Also, on these processors, load instructions are issued as early as possible in order to xe2x80x9chidexe2x80x9d the cache access latencies and thus allow ensuing dependent load instructions to execute with minimal delay. Additionally, compilers separate load instructions from their data dependency. For similar reasons, these techniques lead to requests for data which may not be required immediately. Finally, an L2 cache has a limited amount of wired connections for returning data. When data is sent prior to the time it is required, it utilizes valuable wired cache line resources which may be required for more immediate or important data requests.
In the prior art load instructions may be issued out of order. Often times this results in a load queue occupying valuable cache line resources or register space for many cycles before it is utilized by the program. When a large number of load instructions are present this results in loading down the critical cache and queue resources resulting in less efficient processing.
When the data cache is xe2x80x9cbombardedxe2x80x9d with load requests, no ordering information is included. The data cache is oblivious as to which load instruction to process and in which order. In traditional processors, ordering information is typically implied based on a xe2x80x9cFirst Come First Servexe2x80x9d prioritization scheme. However in some cases data is often not required by the processor or program at the time, or in the order, it is requested.
Thus many hardware and software limitations exist in the current method of loading data from a data cache. It is obvious that a more efficient means of loading data from a data cache needs to be developed. A processor should be able to issue its data requests so that the data cache can more optimally deliver the data only when it is actually required.
It would therefore be desirable to provide a method and system for improving the efficiency of load instruction processing and subsequent loading of data. It is further desirable to provide a method and system which allows for just-in-time delivery and/or time-ordered delivery of data during execution of an instruction set thus allowing data to be loaded from a data cache at the time when needed within the program execution stream.
It is therefore one object of the present invention to provide an improved data processing system.
It is another object of the present invention to provide an improved method and system for efficiently managing multiple load requests to a data cache.
It is yet another object of the present invention to provide a method and system for implementing just-in-time delivery of data requested by load instructions.
The foregoing objects are achieved as is now described. A system for time-ordered execution of load instructions is disclosed. More specifically, the system enables just-in-time delivery of data requested by a load instruction. The system consists of a processor, an L1 data cache with corresponding L1 cache controller, and an instruction processor. The instruction processor manipulates an architected Time Dependency Field (TDF) of a load instruction to create a Distance of Dependency (DoD) bit field. The DoD bit field holds a relative dependency value which is utilized to order the load instruction in a Relative Time-Ordered Queue (RTOQ) of the L1 cache controller. The load instruction is sent from RTOQ to the L1 data cache at a particular time so that the data requested is loaded from the L1 data cache at the time specified by the DoD bit field. In the preferred embodiment, an acknowledgement is sent to the processing unit when the time specified is available in the RTOQ.
The above as well as additional objects, features, and advantages of an illustrative embodiment will become apparent in the following detailed written description.