1. Field of the Invention
This invention relates generally to memory units within a large scale symmetrical multiprocessor system, and, more specifically, to a high-performance memory having integrated directory and data subsystems that allow for the interleaving of memory requests to a single memory unit.
2. Description of the Prior Art
Data processing systems are becoming increasing complex. Some systems, such as Symmetric Multi-Processor (SMP) computer systems, couple two or more processors to shared memory. This allows multiple processors to operate simultaneously on the same task, and also allows multiple tasks to be performed at the same time to increase system throughput.
Although multi-processor systems with a shared main memory may allow for increased throughput, substantial design challenges must be overcome before the increased parallel processing capabilities may be leveraged. For example, the various processors in the system must be able to access memory in a timely fashion. Otherwise, the memory becomes a bottle neck, the processors may spend large amounts of time idle while waiting for memory requests to be processed. This problem becomes greater as the number of processors sharing the same memory increases.
One common method of solving this problem involves providing one or more high-speed cache memories that are more closely-coupled to the processors than the main memory. For example, a cache memory could be coupled to each processor. Information from main memory that is required by a processor during a given task may be temporarily stored within its respective cache so that many requests to memory will be off-loaded. This reduces requests to main memory to a number that is manageable, and allows memory latency to be reduced to acceptable levels.
When multiple cache memories are coupled to a single main memory for the purpose of temporarily storing data signals, some system must be utilized to ensure that all processors are working from the same (most recent) copy of the data. For example, if a copy of a data item is stored, and subsequently modified, in a cache memory, another processor requesting access to the same data item must be prevented from using the older copy of the data item stored either in main memory or the requesting processor""s cache. This is referred to as maintaining cache coherency. Maintaining cache coherency becomes more difficult as more caches are added to the system since more copies of a single data item may have to be tracked.
Many methods exist to maintain cache coherency. Some earlier systems achieve coherency by implementing memory locks. That is, if an updated copy of data existed within a local cache, other processors were prohibited from obtaining a copy of the data from main memory until the updated copy was returned to main memory, thereby releasing the lock. For complex systems, the additional hardware and/or operating time required for setting and releasing the locks within main memory becomes too large a burden on through-put to be acceptable. Furthermore, reliance on such locks directly prohibits certain types of applications such as parallel processing.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 4,843,542 issued to Dashiell et al., and in U.S. Pat. No. 4,755,930 issued to Wilson, Jr., et al. These patents discuss a system wherein each processor has a local cache coupled to a shared memory through a common memory bus. Each processor is responsible for monitoring, or xe2x80x9csnoopingxe2x80x9d, the common bus to maintain currency of its own cache data. These snooping protocols increase processor overhead, and are unworkable in hierarchical memory configurations that do not have a common bus structure. A similar snooping protocol is shown in U.S. Pat. No. 5,025,365 to Mathur et al., which teaches local caches that monitor a system bus for the occurrence of memory accesses which would invalidate a local copy of data. The Mathur snooping protocol removes some of overhead associated with snooping by invalidating data within the local caches at times when data accesses are not occurring; however, the Mathur system is still unworkable in memory systems without a common bus structure.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 5,423,016 to Tsuchiya, assigned to the assignee of this invention. The method described in this patent involves providing a memory structure utilizing a xe2x80x9cduplicate tagxe2x80x9d with each cache memory. The duplicate tags record which data items are stored within the associated cache. When a data item is modified by a processor, an invalidation request is routed to all of the other duplicate tags in the system. The duplicate tags are searched for the address of the referenced data item. If found, the data item is marked as invalid in the other caches. Such an approach is impractical for distributed systems having many caches interconnected in a hierarchical fashion because the time requited to route the invalidation requests poses an undue overhead.
For distributed systems having hierarchical memory structures, a directory-based coherency system has been found to have advantages. Directory-based coherency systems utilize a centralized directory to record the location and the status of data as it exists throughout the system. For example, the directory records which caches have a copy of the data, and further records if any of the caches have an updated copy of the data. When a processor makes a request to main memory for a unit of data, the central directory is consulted to determine where the most recent copy of that unit of data resides so that it may be returned to the requesting processor and the older copy may be marked invalid. The central directory is then updated to reflect the new status for that unit of memory. A novel system and method for performing a directory-based coherency protocol in a Symmetrical Multi-Processor (SMP) system is described in the co-pending application entitled xe2x80x9cA Directory-Based Cache Coherency, Systemxe2x80x9d, filed Nov. 5, 1997, Ser. No. 08/965,004 which is incorporated herein by reference in its entirety.
Implementing high-speed memory systems that are capable of supporting a directory-based coherency protocol is problematic for several reasons. In general, accessing the central directory involves a read-modify-write operation. That is, generally, directory information is read from the directory, modified to reflect the fact that new status associated with the data item is being delivered to the requesting processor, and is written back to the directory. This read-modify-write operation cannot be completed as fast as the (single) associated data access to memory. Thus, another data access may not be initiated until the associated read-modify-write operation is complete and memory throughput is therefore diminished.
Prior art systems attempted to make this longer directory latency transparent to the overall system operation by implementing the central directory using faster hardware technology. For example, the memory array used to implement the central directory was implemented using faster Static Random Access Memory (SRAM) devices, whereas the memory array used to implement the data storage was designed using slower, but more dense, Dynamic Random Access Memory (DRAM) devices. This creates practical problems. Because SRAM devices are not as dense as DRAMs, a disproportionally large amount of circuit board area is consumed to implement the directory storage. Moreover, SRAMs and DRAMs have different power and other electrical considerations, adding to the complexity associated with designing, placing, and routing an operational printed circuit card. Additionally, two types of RAM devices must be stocked, then handled during the board-build process making fabrication of the printed circuit card a more difficult and expensive process. Implementing both the directory and data memory arrays using the same logic is practically much more desirable, but would result in a decrease in overall system throughput.
Another problem associated with memory systems capable of supporting directory-based coherency protocols is that such systems tend to under-utilize shared bus resources. For example, during the read phase of a read-modify-write operation to the directory array, an address is driven onto the address bus so that the directory state information may be read by the control logic. After the directory state information is read, and while it is being modified by the control logic, the address, data, and control buses are idle, and bandpass is essentially wasted. This intermittent pattern of bus usage can result in address and data buses that are idle as much as fifty percent of the time.
It is the primary object of the invention to provide an improved high-speed memory system that supports a directory-based coherency protocol;
It is a further object of the invention to provide an improved high-speed memory system that includes a directory storage facility and an associated data storage facility, wherein the directory storage facility is capable of processing memory requests at a similar rate as that of the data storage facility;
It is still a further object of the invention to provide an improved high-speed memory system that includes a directory storage facility and an associated data storage facility, wherein the directory storage facility utilizes the same hardware technology as an associated data storage facility,
It is yet another object of the invention to provide an improved memory system including a directory storage facility and an associated data storage facility, wherein the memory system is coupled to high-speed data and address buses, and wherein operations to the memory system are interleaved so that the bus idle time is minimized;
It is yet a further object of the invention to provide an improved high-speed memory system which includes a directory storage facility and an associated data storage facility, wherein both the directory storage facility and the data storage facility include multiple banks of memory which may be accessed simultaneously during interleaved operations,
It is another object of the invention to provide an improved high-speed memory system having multiple sub-systems, wherein each sub-system includes a directory storage facility and an associated data storage facility, and wherein operations may be performed substantially simultaneously to multiple ones of the sub-systems during interleaved operations, and
It is still another object of the invention to provide an improved high-speed memory system having multiple sub-systems, wherein each sub-system includes a directory storage facility and an associated data storage facility, and wherein data is stored to, or retrieved from, each of the data storage facilities during multi-transfer operations wherein a single memory operation is completed during multiple transfers over a single interface.
The objectives of the present invention are achieved in a high-speed memory system for use in supporting a directory-based cache coherency protocol. The memory system includes at least one data sub-system for storing data, and a corresponding directory subsystem for storing the corresponding cache coherency information. The memory system may be coupled to multiple processors for accepting read and write memory requests from ones of the multiple processors.
When a processor submits a request for memory access to the memory system, two operations are initiated, one to a data sub-system, and the second to a corresponding directory sub-system. The data sub-system performs a block-mode memory read or write operation across the data sub-system data bus. In the preferred embodiment, each blockmode operation transfers a predetermined number of bytes across the data bus during a number of successive transfers. While the data sub-system is performing the block-mode data transfer, the directory sub-system executes a read-modify-write operation whereby directory information is read from the directory sub- system, modified by a memory controller, and written back to the directory sub-system. Because the data sub-system transfers blocks of data across the data bus during multiple transfer operations, the time required to perform the read-modify-write operation can approximate the time required to complete the data operation.
To further ensure that directory operations do not significantly limit system throughput, an interleaved memory scheme is utilized whereby a multiple number of read or write operations may be occurring to the data sub-system simultaneously. The associated read-modify-write operations to the directory sub-system are also interleaved. The time required to complete the multiple interleaved operations within both the data and directory sub-systems is approximately equivalent. Therefore, directory operations are made essentially transparent to the overall system throughput without using faster memory devices to implement the directory sub-system. This allows the memory system to be constructed using memory devices which are more dense, so that the overall memory system is more compact. Moreover, the overall memory design is less complex, and is less expensive to both design, construct, and test.
Another aspect of the current invention involves an improved management of bus resources. The data sub-system and directory sub-system are designed to share address, data, and control buses. This saves route channels used to route the nets within the printed circuit board. This is especially important in large memory systems requiring numerous control and address signals, such as the one described in this Specification. Moreover, because of the interleaving of memory requests, the shared address bus is not idle a large percentage of the time, as in prior art systems. During the times when the address bus would normally be idle, for example while directory state information for a first memory operation is being modified, another request address is driven onto the address bus to initiate a second memory operation. Then as the second memory operation is being performed, the address associated with the first request is re-driven onto the address bus so that the modified directory state information may be stored in the directory sub-system. Additionally, because data is transferred in blocks, and because memory operations are interleaved so that a first operation is using the data bus while a second operation is initiated within the storage devices, the data bus is also used in a more efficient manner. In sum, the current design allows for dramatically increased system throughput without an increase in the number of interconnecting nets needed to interface with each of the memory sub-systems.
Finally, the memory system of the current invention is a modular design that is readily expandable. In the preferred embodiment, the data and directory sub-systems are each located within separate Dual In-line Memory Modules (DIMMs) that are received by two sockets on a daughter board that constitutes an Main Storage Unit (MSU) Expansion. Each MSU Expansion is a Field Replaceable Unit (FRU) which may be easily replaced should memory errors be detected. In the preferred embodiment, each DIMM may include between 64 MegaBytes (Mbytes) and 256 MBytes of storage, so that each MSU Expansion may be populated with between 128 MBytes to 512 MBytes. Furthermore, the memory system may be incrementally expanded to include additional MSU Expansions as the memory requirements of the host system grow.
Still other objects and advantages of the present invention will become readily apparent to those skilled in the art from the following detailed description of the preferred embodiment and the drawings, wherein only the preferred embodiment of the invention is shown, simply by way of illustration of the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are-capable of modifications in various respects, all without departing from the invention. Accordingly, the drawings and description are to be regarded to the extent of applicable law as illustrative in nature and not as restrictive.