Conventionally, microprocessor and memory system applications retrieve data from relatively low bandwidth I/O devices, process the data by a processor and then store the processed data to the low bandwidth I/O devices. In typical microprocessor memory system applications, a processor and a cache are directly coupled by a bus to a plurality of peripherals such as I/O devices and memory devices. However, the processor may be unnecessarily idled due to the latency between the I/O devices and the processor, thus causing the processor to stall and, as a result, excessive time is required to complete tasks.
In known processing systems, when an operation is performed on one of the peripherals such as a memory device, the time between performing such an operation and subsequent operations is dependent upon the latency period of the memory device. Thereby, the processor will be stalled during the entire duration of the memory transaction. One solution for improving the processing speed of known processing systems is to perform additional operations during the time of the latency period as long as there is no load-use dependency upon the additional operations. For example, if data is loaded into a first register by a first operation (1), where the first operation (1) corresponds to: EQU load r1.rarw.r2! (1)
and the first register is added to another operation by a second operation (2) where the second operation (2) corresponds to: EQU add r3.rarw.r1+r4 (2)
the operation (2) is load-use dependent and the operation (2) must wait for the latency period before being performed. FIGS 1(a) and 1(b) illustrate a load-use dependent operation where the time for initiating the operation (2) must wait until the operation (1) is completed. Operation (2) is dependent upon the short latency period t.sub.1 in FIG. 1(a) corresponding to a fast memory and the long latency period t.sub.2 in FIG. 1(b) corresponds to a slower memory.
A non-blocking cache and a non-blocking processor are known where a load operation is performed and additional operations other than loads may be subsequently performed during the latency period as long as the operation is not dependent upon the initial load operation. FIGS. 2(a) and 2(b) illustrate such operations. In operation (1), the first register is loaded. Next, operations (1.1) and (1.2) are to be executed. As long as operations (1.1) and (1.2) are not load dependent on another load, these operations may be performed during the latency period t.sub.2 as illustrated in FIG. 2(a). However, if operation (1.1) is either load-dependent or a pending load, operations (1.1) and (1.2) must wait until the latency period t.sub.2 ends before being performed.
Also known is a Stall-On-Use (Hit Under Miss) operation for achieving cache miss optimizations as described in "A 200 MFLOP Precision Architecture Processor" at Hot Chips IV, 1993, William Jaffe et al. In this Hit Under Miss operation, when one miss is outstanding only certain other types of instructions may be executed before the system stalls. For example, during the handling of a load miss, execution proceeds until the target register is needed as an operand for another instruction or until another load miss occurs. However, this system is not capable of handling two misses being outstanding at the same time. For a store miss, execution proceeds until a load or sub-word store occurs to the missing line.
This Hit Under Miss feature can improve the runtime performance of general-purpose computing applications. Examples of programs that benefit from the Hit Under Miss feature are SPEC benchmarks, SPICE circuit simulators and gcc C compilers. However, the Hit Under Miss feature does not sufficiently meet the high I/O bandwidth requirements for digital signal processing applications such as digital video, audio and RF processing.
Known microprocessor and memory system applications use real-time processes which are programs having deadlines corresponding to times where data processing must be completed. For example, an audio waveform device driver process must supply audio samples at regular intervals to the output buffers of the audio device. When the driver software is late in delivering data for an audio waveform device driver, the generated audio may be interrupted by objectionable noises due to an output buffer underflowing.
In order to analyze whether or not a real-time process can meet its deadlines under all conditions requires predictability of the worst-case performance of the real-time processing program. However, the sensitivity of the real-time processing program to its input data or its environment makes it impractical in many cases to exhaustively check the behavior of the process under all conditions. Therefore, the programmer must rely on some combination of analysis and empirical tests to verify that the real-time process will complete in the requisite time. The goals of real-time processing tend to be incompatible with computing platforms that have memory or peripheral systems in which the latency of the transactions is unpredictable because an analysis of whether the real-time deadlines can be met may not be possible or worst-case assumptions of memory performance are required. For example, performance estimates can be made by assuming that every memory transaction takes the maximum possible time. However, such an assumption may be so pessimistic that any useful estimate for the upper bound on the execution time of a real-time task cannot be made. Furthermore, even if the estimates are only slightly pessimistic, overly conservative decisions will be made for the hardware performance requirements so that a system results that is more expensive than necessary.
Also, it is especially difficult to reliably predict real-time processing performance on known multiprocessors because the memory and peripherals are not typically multi-ported. Therefore simultaneous access by two or more processors to the same memory device must be serialized. Even if a device is capable of handling multiple transactions in parallel, the bus shared by all of the processors may still serialize the transactions to some degree.
If memory requests are handled in a FIFO manner by a known multiprocessor, a memory transaction which arrives slightly later than another memory transaction may take a much longer amount of time to complete since the later arriving memory requests must wait until the earlier memory request is serviced. Due to this sensitivity, very small changes in the memory access patterns of a program can cause large changes in its performance. This situation grows worse as more processors share the same memory. For example, if ten processors attempt to access the same remote memory location simultaneously, the spread in memory latency among the processors might be 10:1 because as many as nine of these memory transactions may be buffered for later handling. In general, it is not possible to predict which of these processors will suffer the higher latencies and which of these processors will receive fast replies to their memory accesses. Very small changes to a program or its input data may cause the program to exhibit slight operation differences which perturb the timing of the memory transactions.
Furthermore, types of memory which exhibit locality effects may exacerbate the above-described situation. For example, accesses to DRAMs are approximately two times faster if executed in page mode. To use page mode, a recent access must have been made to an address in the same memory segment (page). One of the most common access patterns is sequential accesses to consecutive locations in memory. These memory patterns tend to achieve high page locality, thus achieving high throughput and low latency. Known programs which attempt to take advantage of the benefits of page mode may be thwarted when a second program executing on another processor is allowed to interpose memory transactions on a different memory page. For instance, if ten processors, each with its own sequential memory access pattern, attempt to access the same DRAM bank simultaneously and each of the accesses is to a different memory page, the spread and memory latencies between the fastest and slowest responses might be more than 25:1.
The present invention is directed to allowing a high rate of transfer to memory and I/O devices for tasks which have real-time requirements. The present invention is also directed to allowing the system to buffer I/O requests from several processors within a multiprocessor at once with a non-blocking load buffer. Furthermore, the present invention is directed to extending the basic non-blocking load buffer to service a data processing system running real-time processes of varying deadlines by using scheduling of memory and peripheral accesses which is not strictly FIFO scheduling.