In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
In addition to increasing clock speeds, it is possible to improve system throughput by using multiple copies of certain components, and in particular, by using multiple CPUs. Without delving too deeply into the architectural issues introduced by using multiple CPUs, it can still said that there are benefits to increasing the throughput of the individual CPU, whether or not a system uses multiple CPUs. For a given clock speed, it is possible to increase the throughput of an individual CPU by increasing the average number of operations executed per clock cycle.
Various advances in parallelism have enabled computer system designers to increase the average number of operations executed per clock cycle within an individual CPU. For example, certain wide instruction computers (sometimes known as “wide issue superscalar” or “very long instruction word” computers) enable each instruction to specify multiple operations to be performed in parallel, and accordingly contain parallel hardware necessary for executing such wide instructions. However, generating programming code which takes full advantage of this capability is somewhat difficult. Another approach to parallelism, which can sometimes be combined with other techniques, is support for multiple threads of execution ( i.e., multiple streams of encoded instructions) within a single computer processor. Multi-threaded processors generally have parallel banks of registers which permit the processor to maintain the state of multiple threads of execution at one time.
Recently, another approach to parallelism has gained favor, in which the CPU can dispatch multiple instructions for execution in a given clock cycle. Using this approach, the hardware analyzes the instruction stream to determine whether dependencies exist (i.e., whether one instruction must wait for a previous instruction to be performed first), and selects non-conflicting instructions for parallel execution where possible. This approach may be used with a multi-threaded processor or a single-threaded processor.
Any form of parallelism within a processor generally requires additional hardware, in some cases duplicating entire sets of logic circuits, registers or other components. But it is also true that there should be some commonality, otherwise nothing is gained over a mere duplication of the processor itself and all associated supporting circuits.
Most computer systems store data in a hierarchy of memory or storage devices, each succeeding level having faster access, but storing less data. At the lowest level is the mass storage unit or units, which store all the data on relatively slow devices. Moving up the hierarchy is a main memory, which is generally semiconductor memory. Main memory has a much smaller data capacity than the storage units, but a much faster access. Higher still are caches, which may be at a single level, or multiple levels (level 1 being the highest), of the hierarchy. Caches are also semiconductor memory, but are faster than main memory, and again have a smaller data capacity.
When the processor needs data, it looks for it first in the cache, and if the cache is a multi-level cache, will look first in the highest level cache. Retrieving data from lower level caches, main memory, or disk storage requires progressively more time, so much time that a processor can spend the bulk of its time merely waiting for data to be retrieved. One way in which the average number of operations per clock cycle can be increased is to increase the proportion of times that needed data is in the cache (a “cache hit”), and preferably at the highest level, rather than some entity lower on the memory hierarchy. Various techniques exist for selecting data to be held in the cache, but all other things being equal, the probability of a cache hit can be increased by increasing the size of the cache.
Where a computer processor employs any of various parallelism techniques, it is possible that multiple accesses, such as multiple reads, to the same cache will take place simultaneously. Simple semiconductor memory designs support only a single read access to a bank of memory devices at one time. The cache can therefore become a bottleneck to performance in a processor employing parallelism.
It is possible to permit multiple simultaneous cache accesses by simply providing separate caches for each access path. For example, this approach is used in some multi-threaded processor designs, in which each thread has its own cache, which can be accessed independently of the others. But a consequence of separate caches is that each individual cache is necessarily only a fraction of the whole, which reduces the probability of a cache hit.
It is further possible to design multi-port caches, in which the individual memory cells and supporting hardware permit multiple cells to be read or written to simultaneously. However, these designs introduce significant additional circuit complexity, and as a result of the additional logic circuits required, increase the time required to access the cache.
As the demand for ever faster processors grows, it is likely that processor parallelism will increase. Further increases in the number of ports in a conventional multi-port cache design will only exacerbate the existing problems of circuit complexity and access time. It is therefore desirable to find alternative techniques for providing multiple parallel accesses to cached data, which reduce or eliminate the drawbacks associated with conventional techniques.