In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip, and increased clock speed through further size reduction and other improvements continues to be a goal. In addition to increasing clock speeds, it is possible to increase the throughput of an individual CPU by increasing the average number of operations executed per clock cycle.
A typical computer system can store a vast amount of data, and the processor may be called upon to use any part of this data. The devices typically used for storing mass data (e.g., rotating magnetic hard disk drive storage units) require relatively long latency time to access data stored thereon. If a processor were to access data directly from such a mass storage device every time it performed an operation, it would spend nearly all of its time waiting for the storage device to return the data, and its throughput would be very low indeed. As a result, computer systems store data in a hierarchy of memory or storage devices, each succeeding level having faster access, but storing less data. At the lowest level is the mass storage unit or units, which store all the data on relatively slow devices. Moving up the hierarchy is a main memory, which is generally semiconductor memory. Main memory has a much smaller data capacity than the storage units, but a much faster access. Higher still are caches, which may be at a single level, or multiple levels (level 1 being the highest), of the hierarchy. Caches are also semiconductor memory, but are faster than main memory, and again have a smaller data capacity. One may even consider externally stored data, such as data accessible by a network connection, to be even a further level of the hierarchy below the computer system's own mass storage units, since the volume of data potentially available from network connections (e.g., the Internet) is even larger still, but access time is slower.
When the processor generates a memory reference address, it looks for the required data first in cache (which may require searches at multiple cache levels). If the data is not there (referred to as a “cache miss”), the processor obtains the data from memory, or if necessary, from storage. Memory access requires a relatively large number of processor cycles, during which the processor is generally idle. Ideally, the cache level closest to the processor stores the data which is currently needed by the processor, so that when the processor generates a memory reference, it does not have to wait for a relatively long latency data access to complete. However, since the capacity of any of the cache levels is only a small fraction of the capacity of main memory, which is itself only a small fraction of the capacity of the mass storage unit(s), it is not possible to simply load all the data into the cache. Some technique must exist for selecting data to be stored in cache, so that when the processor needs a particular data item, it will probably be there.
A cache is typically divided into units of data called lines, a line being the smallest unit of data that can be independently loaded into the cache or removed from the cache. In order to support any of various selective caching techniques, caches are typically addressed using associative sets of cache lines. An associative set is a set of cache lines, all of which share a common cache index number. The cache index number is typically derived from high-order bits of a referenced address, although it may include other bits as well. The cache being much smaller than main memory, an associative set holds only a small portion of the main memory addresses which correspond to the cache index number. Since each associative set typically contains multiple cache lines, the contents of the associative set can be selectively chosen from main memory according to any of various techniques.
Typically, data is loaded into a high level cache upon the occurrence of a cache miss. Conventional techniques for selecting data to be stored in the cache also include various pre-fetching techniques, which attempt to predict that data in a particular cache line will be needed in advance of an actual memory reference to that cache line, and accordingly load the data to the cache in anticipation of a future need. Since the cache has limited capacity, loading data upon a cache miss or by pre-fetching necessarily implies that some data currently in the cache will be removed, or cast out, of the cache. Again, various conventional techniques exist for determining which data will be cast out in such an event.
Although conventional techniques for selecting the cache contents have achieved limited success, it has been observed that in many environments, the processor spends the bulk of its time idling on cache misses. The typical approaches to this problem have been to increase the size and/or associativity of the cache, both of which involve significant additional hardware. There exists a need for improved techniques for the design and operation of caches.