Not applicable.
1. Field of the Invention
The present invention generally relates to a computer system that includes a plurality of microprocessors. More particularly, the invention relates to a multiple processor computer system with distributed memory sub-systems accessible by the processors in the system. Still more particularly, the present invention relates to an improved system and method that supports multiple address interleaving techniques that can be active simultaneously to reduce latency and increase memory bandwidth.
2. Background of the Invention
One of the basic issues in any computer system is determining the most efficient technique to address the various memory devices that are present in the system. The memory in a computer system stores data and instructions for subsequent retrieval and use by the processor and other components in the computer system. To facilitate the storage, retrieval and subsequent use of they data and instructions, the processor and other computer system components must be able to identify the address of the stored data. Typically, the computer system implements a defined protocol for assigning addresses to stored data. Whenever data is written or read from memory, the component requesting the transaction transmits an address signal or command to the memory identifying where the data should be written, or conversely, from where the data should be read. The memory typically has an associated memory controller that includes an address decoder that decodes the bits in the address signal to determine the location within memory being accessed. In a conventional memory system, this includes identifying the page of memory, and within the page, the row and column of the data being written or read. The particular coding in the address signal or command typically identifies the starting address of a particular memory device, while other bits identify the offset within the memory device where the particular access is targeted.
When data is written into memory, typically continuous memory addresses are used to identify contiguous memory locations. Thus, for example, the address 8001 will be followed by address 8002 (both of which would be written in binary format) to identify adjacent memory locations within a page of memory. More recently, it has become commonplace to include banks of memory within a computer system, so that a conventional personal computer system may include a single processor with multiple memory banks accessible via different memory ports. Some or all of the memory banks may be populated with some form of dynamic random access memory (xe2x80x9cDRAMxe2x80x9d). In systems with multiple memory banks, it has become common to implement some form of interleaving to more efficiently distribute the data within the memory banks. Thus, for example, each continuous address of memory may be distributed among different memory banks, instead of within a single memory bank. The advantage of such an interleaving scheme is that it may increase memory bandwidth, because it permits the higher speed processor to conduct overlapping memory transactions to the slower speed memory banks via the different memory ports.
To implement an interleaving scheme in a single processor system, certain bits in the address command are selected to identify the memory bank being accessed. Thus, for example, if eight memory banks are available in the system, three of the address bits might be used to identify a specific memory bank. If these three address bits are the low order bits in the address command, then consecutive memory addresses are distributed across the memory banks automatically by the system hardware. In such a system, the address 8000 might correspond to an address location in memory bank 1, while address 8001 might correspond to an address location in memory bank 2. Thus, by using the low order address bits to define the memory bank, the system will interleave data among memory banks as the operating system increments through the address space.
If conversely, the three address bits identifying the memory bank are high order bits (above the bits identifying the virtual page size), then address interleaving typically is performed as part of the software translation from the virtual address to the physical address. Thus, in this type of system, the interleaving is determined by software page placement policy choices typically programmed into the operating system.
In a distributed memory, multi-processor computer system, the memory is distributed throughout the computer system, and is not located in one finite location. In particular, one technique for implementing such a system is to associate memory with each processor in the computer system. Each of the processors within the system may be capable of accessing the memory associated with any other processor by properly transmitting a command coupled with the desired memory address to the appropriate memory location. Identifying an address within any particular memory location requires selecting the processor associated with the memory.
Because memory is distributed throughout the computer system, and multiple processors exist that may each simultaneously seek to access the same memory device or even the same memory data, special steps must be implemented to insure coherency of the data, while still maximizing the speed of memory accesses to minimize system latency. In an attempt to reduce latency (or xe2x80x9cwaitingxe2x80x9d) caused by coincident accesses to the same memory location, memory may be distributed within a particular processor sub-system by including multiple memory ports supporting separate memory banks. This adds yet another level of detail that must be identified in the address coding scheme. Thus, in addition to the processor identification, the address command must also identify the memory bank and the memory offset for that particular memory bank.
The conventional technique for addressing memory in a distributed memory computer system is to have the operating system assign continuous address references to contiguous locations on the same processor. Thus, typically the high order bits in the address define the processor, and the lower order bits define the offset in the memory associated with that processor. Thus, as the operating system increments through the address space, the processor being accessed does not change, as the lower order address bits are incremented. Thus, incrementing address space means that the data transactions occur locally on a given processor. Such a situation may be advantageous if the local processor is the source of the data transactions because it reduces the latency of the memory transactions by avoiding the necessity of transmitting commands to another processor to obtain the requested data. In other instances, however, this addressing scheme may be unfavorable. If, for example, multiple processors are referencing the same contiguous piece of memory associated with a different processor, a bottleneck may occur as each requesting processor tries to simultaneously communicate with the processor that controls the targeted memory.
Because the processor identification occurs in the high order bits of the address signal, typically the interleaving of data among processors is performed through software. Thus, in high order interleaving systems that are used with multiple processing systems, the task of distributing addresses is made at a page granularity level by the system software when it determines the virtual-to-physical page translation. Such software implementations, however, require involvement of the processor, and thus may act as a drag on system performance. In addition, simultaneous software interleaving can be very expensive since it requires many operations to convert addresses to a canonical form necessary for the hardware. Software interleaving also can be difficult to implement, and may require additional clock cycles for each memory transaction performed. It would be advantageous to develop a hardware address scheme that permits simultaneous interleaving without the attendant problems caused by software interleaving.
The problems noted above are solved in large part by the system and techniques of the present invention, which permit multiple different address interleavings to be active simultaneously. In particular, unstriped addresses are used to interleave across processors using high order address bits. This allows instructions to be copied locally to all processors in the system, ensuring that all instructions are transmitted with low latency. Striped addresses, conversely, interleave across four processor sets at the low order, and the rest of the processors at high order. This makes a group of four processors the striped local set, with data references distributed to all memory ports of the four processor set. The striping of addresses within a four processor set reduces bottlenecks that may occur when other processors request data associated with memory of a different processor. The simultaneous use of striped and unstriped addresses can improve system performance, without the attendant deficiencies of software implemented systems.
The interleave scheme implemented in the preferred embodiment of the present invention uses an address bit to distinguish between two different types of address interleavingxe2x80x94striped and unstriped. Preferably each processor includes two memory ports, with an entire cache block assigned to a single memory port. In both striped and unstriped interleavings, the lowest order address bits (0-5) indicate the cache alignment, and address bit 6 indicates the port within a processor. The unstriped interleave identifies the cache block within a port in address bits 7-33, and the lower two processor bits in address bits 34 and 35. The striped interleave has the lower two processor bits in address bits 7 and 8, and the cache block in address bits 37-43 (for a system with up to 256 processors, each of which can have 16 GB of memory distributed across 2 ports).
In accordance with the preferred embodiment, the present invention is implemented in hardware. In response to a memory access that results in a cache miss, the hardware converts the address into a single canonical form which has the port, offset, and processor fields in fixed positions. These address bits are then transferred along with bit 36, which comprises the stripe bit, to the port. The port returns the cache block in response. If necessary, the port may re-convert the address into its original form using the stripe bit if it needs to extract the block from another processor""s cache. After conversion to the canonical form, the hardware manages the interleaving uniformly for each case by forwarding the reference to the appropriate memory port.
According to the preferred embodiment, the striped interleave is used for data that is more likely to be accessed by other processors, while unstriped interleaves is used for data that is likely to only be accessed by the local processor.