The present invention relates to a computer system incorporating main memory, one or more cache subsystems, and one or more processors.
A substantial part of the work carried out by a typical computer system consists of copying data from one part of memory to another. This can be very time-consuming, as the CPU must issue a memory fetch from the source area, wait for the data to arrive, and then store the data to the destination memory area.
To alleviate some of the delays, cache subsystems have been added to many computers, so that when the CPU issues a memory fetch, the data may be found and retrieved from the faster, more local, cache memory rather than waiting for it to be fetched from main memory. Similarly, the cache subsystem can quickly accept data from the CPU and then write it back to main memory at a slower rate, so that the CPU need not wait for the completion of the transfer.
However, cache memory is relatively expensive, so in a typical system it can only hold a small portion of the main memory data, and cache misses can come to dominate processing, especially when copying large amounts of data.
Some CPUs attempt to reduce the effective delay and the number of cycles wasted while waiting for data by separating the fetch and use (store) phases in time. This may be done explicitly (as in most RISC architectures), by having separate “load” and “store” instructions between which the compiler or programmer can insert a suitable number of other independent instructions to fill the time until the loaded data may be referenced; or it may be implicit, with the CPU running code from a different thread, or dispatching instructions out of order, so that instructions that do not depend on the loaded data may proceed even though instructions which do depend on it are blocked; or some combination of these approaches.
A CPU may also provide a “prefetch” operation which tells the cache subsystem to start loading the referenced data (if not already cached) in anticipation of it being used in the near future, again reducing the latency when the data is referenced.
Operations that move more than a single unit of data per instruction may also be provided (e.g. an UltraSPARC™ has “block load” and “block store” instructions that load 64 bytes of data to or from 8 consecutive floating-point registers).
However, none of these approaches is truly satisfactory. Prefetching is often difficult to use to advantage, especially for relatively small blocks and with direct-mapped caches. Load-use separation and block load/stores are limited by the availability of critical CPU resources (e.g. registers or execution units) and the scheduling capabilities of the programmer, compiler, and CPU.
A different approach is to provide a separate dedicated DMA engine that can perform a copy operation independently, while the CPU continues to execute other instructions. This can provide a benefit in systems where large blocks of data are frequently copied, but it too has significant disadvantages. For example, it requires a fair amount of extra circuitry, either in a separate chip or as an additional functional unit within one of the existing chips, and therefore increases system cost. For each transfer, the DMA engine typically has to be programmed explicitly with source, destination, and size, and so the start-up overhead can be quite high. Also, a DMA engine can typically only use physical addressing, so in a system where the CPU normally works with virtual addresses, there can be extra overhead as the source and destination addresses for each copy must be translated, and a single (logical) block copy may have to be broken up into several physical block copies if the virtual addresses do not map to contiguous physical memory. Perhaps most inconvenient of all, it requires some form of synchronization at the completion of each copy, so that the CPU does not try to use data in the destination block before the DMA engine has written it. Whether polling or interrupts are used, it still represents a significant overhead and a possible source of programming errors if the synchronization rules are violated. These overheads in cost and time typically mean that it is only worth using a DMA engine in systems where a large proportion of the workload consists solely of copying whole, aligned, and contiguous pages of data.