In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip, and increased clock speed through further size reduction and other improvements continues to be a goal. In addition to increasing clock speeds, it is possible to increase the throughput of an individual CPU by increasing the average number of operations executed per clock cycle.
A typical computer system can store a vast amount of data, and the processor may be called upon to use any part of this data. The devices typically used for storing mass data (e.g., rotating magnetic hard disk drive storage units) require relatively long latency time to access data stored thereon. If a processor were to access data directly from such a mass storage device every time it performed an operation, it would spend nearly all of its time waiting for the storage device to return the data, and its throughput would be very low indeed. As a result, computer systems store data in a hierarchy of memory or storage devices, each succeeding level having faster access, but storing less data. At the lowest level is the mass storage unit or units, which store all the data on relatively slow devices. Moving up the hierarchy is a main memory, which is generally semiconductor memory. Main memory has a much smaller data capacity than the storage units, but a much faster access. Higher still are caches, which may be at a single level, or multiple levels (level 1 being the highest), of the hierarchy. Caches are also semiconductor memory, but are faster than main memory, and again have a smaller data capacity. One may even consider externally stored data, such as data accessible by a network connection, to be even a further level of the hierarchy below the computer system's own mass storage units, since the volume of data potentially available from network connections (e.g., the Internet) is even larger still, but access time is slower.
Data is moved from mass storage, to main memory, to cache, for use by the processor. Ideally, data is moved into the cache level closest the processor before it is needed by the processor, so that when it is needed, the processor does not have to wait for a relatively long latency data access to complete. However, since the capacity of any of the cache levels is only a small fraction of the capacity of main memory, which is itself only a small fraction of the capacity of the mass storage unit(s), it is not possible to simply load all the data into the cache. Some technique must exist for selecting data to be stored in cache, so that when the processor needs a particular data item, it will probably be there.
A cache is typically divided into units of data called lines, a line being the smallest unit of data that can be independently loaded into the cache or removed from the cache. In simple cache designs, data is loaded into a cache line on demand, i.e. upon the occurrence of a cache miss. I.e., when the processor needs some piece of data which is not in the cache (a cache miss), the required data is obtained from a lower level of cache, or from memory, and loaded into a cache line. This necessarily means that an existing line of cache data must be selected for removal. Various techniques and devices exist for selecting an existing cache line for removal.
Loading on demand is conceptually simple, but results in a high number of cache misses, and resultant idle time while the processor waits for the necessary data. Accordingly, many sophisticated processor designs employ some form of pre-fetching of cache data. Pre-fetching simply means that a predictive technique exists whereby data which is considered likely to be needed soon is loaded into one or more of the cache levels, before the processor actually requires the data. If the predictive technique is accurate and timely, data will be in the L1 cache before an L1 cache miss occurs, or in some other level of cache from which it can be accessed much faster than from main memory.
Several known pre-fetching techniques exist, which can be used alone or in combination. One technique is sequential pre-fetching, i.e., the next sequential line of address space is pre-fetched, on the theory that this is the most likely data needed next. A confirmation bit may be used with sequential pre-fetching, whereby the bit is set on if data in the cache line was actually accessed when it was last pre-fetched into cache and otherwise set off. If the confirmation bit for the next sequential line of data is set on, the line will be pre-fetched to cache.
Another technique involves the use of a branch target buffer. The branch target buffer is a set of branch target addresses, each address associated with a respective cache line. The branch target buffer records the address referenced by the processor immediately after referencing the associated cache line. When a cache line is referenced, its associated branch target address is a good candidate for pre-fetching. The branch target buffer is more general than sequential pre-fetching, since the branch target may be any address, but it requires substantially more overhead to implement.
Yet another pre-fetching technique, known as “technical stream pre-fetching”, requires the use of a special technical stream address buffer. Each buffer entry is a list of addresses previously accessed by the processor in sequence. If a first address on the list is referenced by the processor, then the remaining addresses are good candidates for pre-fetching. Like the branch target buffer, this technique involves some overhead to implement.
Although conventional pre-fetching techniques such as these have some predictive value, they are still very limited. It has been observed that in many environments, the processor spends the bulk of its time idling on cache misses. Substantial performance improvement would be possible with more accurate and comprehensive pre-fetching techniques, which would significantly reduce the frequency of cache misses.