1. Field of Use
The present invention relates to data processing systems and more particularly to management of the memory subsystem of such systems through processor enhancements.
2. Related Art
A data processing system comprises, in general, a processor, a memory subsystem and an I/O (input/output) subsystem. The instructions describing the required behavior of the data processing system, together with various working values, are stored in the memory subsystem, and the system operates by the continual action of the processor in fetching instructions in an orderly manner from the memory subsystem and performing the operations specified by them, reading data values from memory and writing them back in accordance with the instructions' requirements.
The overall performance of such a system depends upon a number of aspects. Among the most important are the rate at which the processor can execute instructions and the work done by each instruction. All other things being equal, the more work defined by each instruction and the more frequent the execution of the instructions, the more work that is done per unit time.
The rate at which the processor can execute instructions is the lesser of two values, the rate at which the processor could execute the instructions were the memory subsystem to take no time to deliver the values (whether instructions or data) requested by the processor, and the rate at which the memory system can actually deliver the values. It has long been the case that it is straightforward to build a processor which can execute useful instructions much faster than an economical memory subsystem can manage to deliver them. To mitigate the effects of this throttling of performance by the memory subsystem, computers have long provided private local memories, called caches, for the processor in which copies of recently-used data values and instructions are held and which offer performance matched to the needs of the processors. In a system employing caches, it is expected that the majority of requests for instructions or data that the processor makes of the memory subsystem will be supplied by the cache(s), with only some reasonably small percentage of requests actually having to go to the `real` memory subsystem. In such a circumstance, the processor will be able to execute instructions at close to its maximum capability a large proportion of the time.
When an instruction access `misses` in the cache (i.e., must go to the real memory subsystem for the instruction), a conventional processor has no choice but to wait, doing no work, until the memory subsystem delivers the instruction. The time to do this can be sufficient for the processor to have executed several hundred instructions. When a request to read data `misses` in the cache, the processor is often able to continue work at least for a while, if the program was organized in such a way as to request or prefetch the data some time before it was needed. Except in the case of very regular programs (i.e., programs which have well-defined and well-understood patterns of access to memory, such as those which do most of their work operating on matrices and other regular structures), it is unusual for it to be possible to arrange for there to be several hundred instructions between the request and the use of the data. When a request to write data misses in the cache, there can often be little impact on performance, since the processor may copy the value into a buffer for writing into memory at its own pace and so other work may continue in parallel.
To show the effects of cache misses on performance, suppose that the cache on a hit can supply data immediately, but that the memory takes a time equivalent to that needed for the execution of 100 instructions which hit in the caches. Then, if all data accesses hit, but just 1% of the instructions miss in the cache, the processor will execute programs at about half its capability (i.e., on average, each 100 instructions executed will have the first 99 instructions executed at peak rate, while the hundredth instruction will take another 100 instruction times).
There are some workloads, or programs, which have an unfortunate pattern of memory utilization such that their `hit rate` (i.e., the percentage of memory accesses which can be satisfied from an economically-sized cache) is rather low. Because of the effects described before, this can sharply reduce the effective performance of the computer system, reducing it from its peak capability by a factor of 5-10 times.
An example of such a workload is that offered by commercial data-processing workloads typified by database/transactional processing. A further example of such a workload is that offered by current and future multimedia applications, in which large amounts of data are manipulated. A final example is that offered by traditional numerically-intensive computational workloads, which also have very large amounts of data to be manipulated.
To improve the performance of the system under such circumstances, it would be necessary to have other work for the system to perform, and to have the system detect that a that a cache miss had occurred and cause it to perform the other work when this happens. A system organized as an SMP (Symmetrical Multi-Processor) is constructed to take advantage of such a situation. An SMN will generally have more pieces of work outstanding or in progress than there are available processors, and as each piece of work (or `process`) encounters some blocking event (e.g. such as needing data from a disk) the associated operating system arranges to suspend the execution of the blocked process in favor of another process which is not blocked. In this manner, the system can make use of its computational resources fairly efficiently in the presence of extremely long-latency operations like disk I/O.
A generalization of this scheme allows the operating system to be notified- when other blocking events occur, such as a load instruction missing in the cache. Provided that a processor can cease execution of the current process and reactivate an available process in less time than it takes the memory system to provide the requested data, this can also increase resource utilization. In computer systems which embody this approach, it is usual to provide some fast mechanism for swapping out the current `state` of the processor (the values of all its registers, etc.) and swapping `in` the suspended set. This may be done by providing the processor with multiple copies of the necessary resources, and selecting between them; or by providing the processor with access to some very fast private memory in which it can keep copies of the needed resources and swap between sets by performing high speed dump and restore operations. A processor with such a capability is often referred to as `multi-threaded`. Even with a multithreaded processor, context switching can be quite expensive since there may be many registers to save and restore.
While these schemes can provide performance advantages, the process-swapping is always an unexpected event to the processor and is invisible to the software. Therefore, it is most desirable to avoid the overhead costs of process swapping.
Accordingly, it is the primary object of the present invention to provide an efficient method and mechanism for enhancing the overall performance of a processor through the ability to do useful work in parallel with long-latency main memory accesses.
It is a further object of the present invention to provide an efficient method and mechanism which can be easily incorporated or added to current microprocessor architectures.