In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises one or more central processing units (CPUs) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of all of the various components simultaneously. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer systems contained processors which were constructed from many discrete components. These systems were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip.
Simply improving the speed of a single component will not necessarily result in a corresponding increase in system throughput. The faster component may find itself idle while waiting for some slower component most of the time.
A computer's CPU operates on data stored in the computer's addressable main memory. The memory stores both the instructions which execute in the processor, and the data which is manipulated by those instructions. In operation, the processor is constantly accessing instructions and other data in memory, without which it is unable to perform useful work. In recent years, improvements to processor speed have generally outpaced improvements to the speed of accessing data in memory. The time required to access this data is therefore a significant factor affecting system throughput.
Memory is typically embodied in a set of integrated circuit modules. The time required to access memory is not only a function of the operational speed of the memory modules themselves, but of the speed of the path between the processor and memory. As computers have grown more complex, this path has consumed a larger share of the access time. Early computers had but a single processor and a relatively small memory, making the path between processor and memory relatively direct. Large modern systems typically contain multiple processors, multiple levels of cache, complex addressing mechanisms, and very large main memories to support the data requirements of the system. In these systems, it is simply not possible for direct paths to exist from every processor to every memory module. Complex bus structures support the movement of data among various system components. Often, data must traverse several structures between the processor and the actual memory module. As the number of processors and size of memory grows, this problem becomes more acute.
One architectural approach that has gained some favor in recent years is the design of computer systems having discrete nodes of processors and associated memory, also known as distributed shared memory computer systems or non-uniform memory access (NUMA) computer systems. In a conventional symmetrical multi-processor (SMP) system, main memory is designed as a single large data storage entity, which is equally accessible to all CPUs in the system. As the number of CPUs increases, there are greater bottlenecks in the buses and accessing mechanisms to such main memory. A NUMA system addresses this problem by dividing main memory into discrete subsets, each of which is physically associated with a respective CPU, or more typically, a respective group of CPUs. A subset of memory and associated CPUs and other hardware is sometimes called a “node”. A node typically has an internal memory bus providing relatively direct access from a CPU to a local memory within the node. Indirect mechanisms, which are slower, exist to access memory across node boundaries. Thus, while any CPU can still access any arbitrary memory location, a CPU can access addresses in its own node faster than it can access addresses outside its node (hence, the term “non-uniform memory access”). By limiting the number of devices on the internal memory bus of a node, bus arbitration mechanisms and bus traffic can be held to manageable levels even in a system having a large number of CPUs, since most of these CPUs will be in different nodes.
Another design requirement of modern computer systems is flexibility of configuration, i.e., the ability to re-configure the system by adding or re-assigning hardware to handle changing work requirements. A modern multi-processor system architecture typically supports a variable number of processors and memory modules. A system which is configured with a minimum number of such modules can be expanded by adding processors, memory and associated hardware, up to some architecturally defined limit. Simply adding processors and memory to a system sharing a single bus will increase bus contention to the point where the bus is a major bottleneck. Because a NUMA system isolates most of its bus traffic in discrete nodes, it is generally considered more expandable (has increased “scalability” for a large number of processors) than a conventional SMP system.
Due to the need to support hardware configuration upgrades, many large system architectures, whether of a NUMA, SMP or other type, support a heterogeneous mixture of memory modules. I.e., modules of different sizes, bus interface widths, and other parameters are supported.
Unfortunately, flexibility comes at a price. The use of different types of memory modules necessarily increases the complexity of the structures which must interface with the memory. For example, each memory integrated circuit chip has a certain number of rows and columns of memory cells, the number being variable for different types of memory chips. These chips are generally mounted on cards, which may again have differing numbers of modules arranged differently. Depending on types of modules used and their arrangement, the card may internally be divided into banks of different size and configuration, making it possible to access multiple addresses from different banks concurrently. The cards will output data of a certain width through an external interface, the width potentially varying with different memory module types and/or bus configurations.
Conventionally, contiguous bit positions of a real address in memory are allocated to rows, columns, internal banks, modules, and so forth, of memory. This works well if all modules have the same number of rows, columns, etc. But where a heterogenous set of modules is used, address bits of real memory have different significance depending on the memory module type. Somewhere, there must be logic within the system which receives a data address in memory and determines just how to retrieve the data, given the multiple configurations possible. As the number of possible configurations increases, this logic increases in complexity, potentially causing further delay in accessing memory.
A need exists for improved interface techniques for transferring data between processors and memory in a computer system. In particular, a need exists for an improved architectural interface to memory, which supports a heterogenous collection of memory modules.