In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises one or more central processing units (CPUs) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip. In addition to increasing clock speeds, many design improvements to processors have made it possible to increase the throughput of an individual CPU by increasing the average number of operations executed per clock cycle within each processor.
Independently of all the improvements to the individual processor, it is further possible to increase system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical, and integrated circuit technology has even progressed to the point where it is possible to construct multiple processors on a single integrated circuit chip. However, one does not simply double a system's throughput by going from one processor to two. The introduction of multiple processors to a system creates numerous architectural problems. Each processor puts additional demands on the other components of the system such as storage, I/O, memory, and particularly, the communications buses that connect various components. As more processors are introduced, these architectural issues become increasingly complex, scalability becomes more difficult, and there is greater likelihood that processors will spend significant time waiting for some resource being used by another processor.
All of these issues and more are known by system designers, and have been addressed in one form or another. While perfect solutions are not available, improvements in this field continue to be made.
One architectural approach that has gained some favor in recent years is the design of computer systems having discrete nodes of processors and associated memory, also known as distributed shared memory computer systems or non-uniform memory access (NUMA) computer systems. In a conventional symmetrical multi-processor system, main memory is designed as a single large data storage entity, which is equally accessible to all CPUs in the system. As the number of CPUs increases, there are greater bottlenecks in the buses and accessing mechanisms to such main memory. A NUMA system addresses this problem by dividing main memory into discrete subsets, each of which is physically associated with a respective CPU, or more typically, a respective group of CPUs. A subset of memory and associated CPUs and other hardware is sometimes called a “node”. A node typically has an internal memory bus providing direct access from a CPU to a local memory within the node. Indirect mechanisms, which are slower, exist to access memory across node boundaries. Thus, while any CPU can still access any arbitrary memory location, a CPU can access addresses in its own node faster than it can access addresses outside its node (hence, the term “non-uniform memory access”). By limiting the number of devices on the internal memory bus of a node, bus arbitration mechanisms and bus traffic can be held to manageable levels even in a system having a large number of CPUs, since most of these CPUs will be in different nodes. From a hardware standpoint, this means that a NUMA system architecture has the potential advantage of increased scalability.
A typical computer system can store a vast amount of data, and a CPU may be called upon to use any part of this data. The devices typically used for storing mass data (e.g., rotating magnetic hard disk drive storage units) require relatively long latency time to access data stored thereon. If a processor were to access data directly from such a mass storage device every time it performed an operation, it would spend nearly all of its time waiting for the storage device to return the data, and its throughput would be very low indeed. As a result, computer systems store data in a hierarchy of memory or storage devices, each succeeding level having faster access, but storing less data. At the lowest level is the mass storage unit or units, which store all the data on relatively slow devices. Moving up the hierarchy is a main memory, which is generally semiconductor memory. Main memory has a much smaller data capacity than the storage units, but a much faster access. Higher still are caches, which may be at a single level, or multiple levels (level 1 being the highest), of the hierarchy. Caches are also semiconductor memory, but are faster than main memory, and again have a smaller data capacity. Relatively small units of data from memory, called “cache lines”, are stored in cache when needed and deleted when not needed, according to any of various algorithms. In a multi-processor system, cache memory is typically associated with particular processors or groups of processors. For example, a level 1 cache is usually physically constructed on the same integrated circuit chip as the processor, and is used only by a single processor. A lower level cache might be used by a single processor, or shared by a subset of the processors on the system.
Where a computer system contains multiple processors, whether of a NUMA architecture or some other form of multi-processor design, an issue of cache coherency arises. Cache coherency refers to the fact that multiple copies of the same data may exist simultaneously in different caches, associated with different processors or groups of processors. If multiple processors were to alter different copies of the same data stored in different caches, there would be a possibility of data corruption. Accordingly, multi-processor systems employ cache coherency techniques to prevent this from happening. Conventional cache coherency techniques involve the association of a respective coherency state with each cache line in a cache. For example, data may be in a “shared” state, meaning copies of the data may exist elsewhere, or in an “exclusive” state, meaning no other copies are permitted. If data in a “shared” state is altered, then all other copies of the same data in other caches are changed to an “invalid” state, indicating that the copy is no longer reliable, and can not be saved to main memory or storage. Additional states may be defined.
When cached data in a “shared” state is altered, some technique must exist for invalidating other copies of the same data in other caches. In some designs, an invalidation message is simply broadcast to all other caches, allowing appropriate hardware at the receiving end to determine whether any action is required. This simple approach may be appropriate for certain architectures, but it will be observed that in many cases no other copies of the data will exist. Broadcasting therefore causes a large number of unnecessary invalidation messages to be sent. For many computer architectures, and particularly NUMA architectures, it is undesirable to clog the available hardware communications channels with a large number of invalidation messages.
In order to reduce the number of invalidation messages, the system may maintain one or more directories of cache line state information. Particularly, in a NUMA system, each node may contain one or more directories storing cache information for local caches as well as remote caches. I.e., a remote directory lists those cache lines which are stored in caches of other nodes, and state information for those cache lines. In order to avoid duplicating information and have a single point of reference, the remote directory in each node lists only those cache lines which are contained in main memory associated with the node. Conventionally, such directories are arranged as set-associative indexes. The associativity of such a directory must be sufficiently large to accommodate the combined capacity of the caches. I.e., the index must be sufficiently large that there will be available space for an index entry any time data from the node is stored in a cache.
Cache lines are constantly being moved in and out of caches. For many system architectures, it is impractical to keep track of all cache activity in such a directory. In particular, in a NUMA architecture, it is generally impractical for a given node to keep track of all the cache activity taking place in other nodes. If a cache line from a node's memory is stored in a remote cache, the node will not receive any notification when the cache line is removed from the remote cache. As a result, information in the remote cache directory of a NUMA node is overly inclusive. I.e., in typical operation, the remote cache directory contains a large number of entries for cache lines which have already been deleted from or invalidated in remote caches. The consequence of these extraneous entries is that unnecessary invalidation messages are often sent. Unnecessary invalidation messages will not corrupt system data, but they will reduce performance. It would be desirable to have more accurate information in the directory to reduce or eliminate these unnecessary invalidation messages, but transmitting messages to track all the cache activity, and particularly transmitting inter-nodal messages in a NUMA system, would generate more communications traffic than it would eliminate. A need therefore exists for improved techniques to reduce unnecessary bus traffic, and in particular to reduce unnecessary cache line invalidation messages.