1. Field of the Invention
This invention relates generally to an improved system and method for maintaining cache coherency within a hierarchical memory system shared between multiple processors; and more particularly, relates to a system that provides by-pass interfaces to provide for the direct exchange of data between cache memories existing at the lower hierarchical levels in the hierarchical memory system in a manner that maintains memory coherency.
2. Description of the Prior Art
Data processing systems are becoming increasing complex. Some systems, such as Symmetric Multi-Processor (SMP) computer systems, couple two or more Instruction Processors (IPs) and multiple Input/Output (I/O) Modules to shared memory. This allows the multiple IPs to operate simultaneously on the same task, and also allows multiple tasks to be performed at the same time to increase system throughput.
As the number of units coupled to a shared memory increases, more demands are placed on the memory and memory latency increases. To address this problem, high-speed cache memory systems are often coupled to one or more of the IPs for storing data signals that are copied from main memory. These cache memories are generally capable of processing requests faster than the main memory while also serving to reduce the number of requests that the main memory must handle. This increases system throughput.
While the use of cache memories increases system throughput, it causes other design challenges. When multiple cache memories are coupled to a single main memory for the purpose of temporarily storing data signals, some system must be utilized to ensure that all IPs and I/O Modules are working from the same (most recent) copy of the data. For example, if a copy of a data item is stored, and subsequently modified, in a cache memory, another IP requesting access to the same data item must be prevented from using the older copy of the data item stored either in main memory or the requesting IP""s cache. This is referred to as maintaining cache coherency. Maintaining cache coherency becomes more difficult as more caches are added to the system since more copies of a single data item may have to be tracked.
Many methods exist to maintain cache coherency. Some earlier systems achieve coherency by implementing memory locks. That is, if an updated copy of data exists within a local cache, other processors are prohibited from obtaining a copy of the data from main memory until the updated copy is, returned to main memory, thereby releasing the lock. For complex systems, the additional hardware and/or operating time required for setting and releasing the locks within main memory cannot be justified. Furthermore, reliance on such locks directly prohibits certain types of applications such as parallel processing.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 4,843,542 issued to Dashiell et al., and in U.S. Pat. No. 4,755,930 issued to Wilson, Jr. et al. These patents discuss a system wherein each processor has a local cache coupled to a shared memory through a common memory bus. Each processor is responsible for monitoring, or xe2x80x9csnoopingxe2x80x9d, the common bus to maintain currency of its own cache data. These snooping protocols increase processor overhead, and are unworkable in hierarchical memory configurations that do not have a common bus structure. A similar snooping protocol is shown in U.S. Pat. No. 5,025,365 to Mathur et al., which teaches a snooping protocol that seeks to minimize snooping overhead by invalidating data within the local caches at times when other types of cache operations are not occurring. However, the Mathur system can not be implemented in memory systems that do not have a common bus structure.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 5,423,016 to Tsuchiya, which is assigned to the assignee of the current invention. The method described in this patent involves providing a memory structure called a xe2x80x9cduplicate tagxe2x80x9d that is associated with each cache memory. Each duplicate tag records which data items are stored within the associated cache. When a data item is modified by a processor, an invalidation request is routed to all of the other duplicate tags in the system. The duplicate tags are searched for the address of the referenced data item. If found, the data item is marked as invalid in the other caches. Such an approach is impractical for distributed systems having many caches interconnected in a hierarchical fashion because the time required to route the invalidation requests poses an undue overhead.
For distributed systems having hierarchical memory structures, a directory-based coherency system becomes more practical. Directory-based coherency systems utilize a centralized directory to record the location and the status of data as it exists throughout the system. For example, the directory records which caches have a copy of the data, and further records whether any of the resident copies have been updated. When a cache makes a request to main memory for a data item, the central directory is consulted to determine where the most recent copy of that data item resides. Based on this information, the most recent copy of the data is retrieved so that it may be provided to the requesting cache. The central directory is then updated to reflect the new status for that unit of memory. A novel directory-based cache coherency system for use with multiple Instruction Processors coupled to a hierarchical cache structure is described in the copending application entitled xe2x80x9cDirectory-Based Cache Coherency System Supporting Multiple Instruction Processor and Input/Output Cachesxe2x80x9d referenced above and which is incorporated herein by reference in its entirety.
The use of the afore-mentioned directory-based cache coherency system provides an efficient mechanism for sharing data between multiple processors that are coupled to a distributed, hierarchical memory structure. Using such a system, the memory structure may be incrementally expanded to include any multiple levels of cache memory while still maintaining the coherency of the shared data. As the number of levels of hierarchy in the memory system is increased, however, some efficiency is lost when data requested by one cache memory in the system must be retrieved from another cache.
As an example of performance degradation associated with memory requests in a hierarchical cache memory system, consider a system having a main memory coupled to three hierarchical levels of cache memory. In the exemplary system, multiple third-level caches are coupled to the main memory, multiple second-level caches are coupled to each third-level cache, and at least one first-level cache is coupled to each second-level cache. This exemplary system includes a non-inclusive caching scheme. This means that all data stored in a first-level cache is not necessarily stored in the inter-connected second-level cache, and all data stored in a second-level cache is not necessarily stored in the interconnected third-level cache.
Within the above-described system, one or more processors are respectively coupled to make memory requests to an associated first-level cache. Requests for data items not resident in the first-level cache are forwarded to the intercoupled second-level, and in some cases, the third-level caches. If neither of the intercoupled second or third level caches stores the requested data, the request is forwarded to main memory.
Assume that in the current example, a processor makes a request to the intercoupled first-level cache for a read-only copy of specified data. Assume further that the requested data is not stored in this first-level cache. However, another first-level cache within the system stores a read-only copy of the data. Since the copy of the data is read-only, the request can be completed without involving the other first-level cache. That is, the request may be processed by one of the inter-connected second or third-level caches, or if neither of these caches has a copy of the data, by the main memory.
In addition to requests for read-only copies of data, requests may be made to obtain xe2x80x9cexclusivexe2x80x9d copies of data that can be updated by the requesting processor. In these situations, the cache line data will be provided to the requesting cache, and any previously cached copies of the data will be marked as invalid. That is, in this instance, copies of the data may not be shared among multiple caches. This is necessary so that there is only one xe2x80x9cmost-currentxe2x80x9d copy of the data existing in the system and no processor is working from outdated data. Returning to the current example, assume the request to the first-level cache is for an exclusive copy of data. This request must be passed via the cache hierarchy to the main memory. The main memory forwards this request back down the hierarchical memory structure to the first-level cache that stores the requested data. If this first-level cache stores a shared copy of the cache line, or alternatively stores an exclusive copy that has not been modified, then this first-level cache must invalidate the stored copy of the data, indicating that this copy may no longer be used. If this first-level cache stores an exclusive copy of the data, and has further modified the data, the modified data is passed back to the main memory to be stored in the main memory and to be forwarded on to the requesting first-level cache. In this manner, the requesting cache is provided with an exclusive copy of the most recent data.
The steps outlined above with respect to the exclusive data request are similar to those that must be executed if a read-only copy of the data is requested when a copy of the requested data resides exclusively in another cache. The previous exclusive owner must forward a copy of the updated data to main memory to be returned to the requester.
As may be seen from the current example, in a hierarchical memory system having multiple levels of cache that are not all interconnected by a common bus structure, obtaining an exclusive copy of data that can be utilized by a processor for update purposes may be time-consuming. As the number of these so-called xe2x80x9cownershipxe2x80x9d requests for obtaining an exclusively xe2x80x9cownedxe2x80x9d data copy increases within the system, throughput may decrease. This is especially true as additional levels of hierarchy are included in the memory structure.
One mechanism for increasing throughput involves providing a high-speed data return path within the main memory. When data is returned from a previous owner, the high-speed interface forwards the data directly to the requester without the need to perform any type of main memory access. A high-speed interface of this type can be used to route both modified and unmodified data between the various units in the system. Such a system is described in the U.S. Pat. No. 6,167,489 to Bauman et. al. entitled xe2x80x9cSystem and Method for By-Passing Supervisory Memory Intervention for Data Transfers Between Devices Having Local Memoriesxe2x80x9d, issued Dec. 26, 2000, and which is referenced above. While this type of interface decreases the time required to complete the data return operation, data must never-the-less be provided to the main memory in all cases before the data can be forwarded to the requesting processor. This unnecessarily increases traffic on interfaces between main memory and other cache memories. Additionally, some latency is still imposed by the length of the data return path, which extends from the lowest levels of memory hierarchy, to main memory, and back to the lowest memory levels. What is needed, therefore, is a system that minimizes the time required to return data to a requesting processor coupled to the hierarchical memory system by shortening the data return path and by reducing request traffic on the main memory interfaces.
3. Objects
The primary object of the invention is to provide an improved shared memory system for a multiprocessor data processing system;
Another object is to provide a hierarchical memory including a main memory coupled to multiple cache memories and further including at least one data return path to provide data between respectively coupled cache memories without intervention of main memory;
A yet further object is to provide data routing logic at multiple levels in a hierarchical memory system for routing data between memories residing within predetermined levels in the memory system and without intervention of a main memory controller;
A still further object is to reduce data traffic on the main memory interfaces of a hierarchical memory system that includes multiple levels of cache memory;
A yet further object is to provide a by-pass data path system for a modular, expandable memory;
Another object is to provide an improved method of transferring shared, read-only copies of data signals from one cache memory to another in a hierarchical memory system in which the cache memories are intercoupled via a directory-based main memory;
Another object is to provide an improved method of transferring exclusive read/write data copies from one cache memory to another in a hierarchical memory system in which the cache memories are intercoupled via a directory-based main memory;
A still further object is to provide an improved system for maintaining cache coherency within a main memory coupled to multiple cache memories; and
A further object is to provide a hierarchical, directory-based shared memory system having improved response times.
The objectives of the present invention are achieved in a hierarchical, multi-level, memory system that provides by-pass paths between storage devices located at predetermined levels within the memory hierarchy. The hierarchical memory system of the preferred embodiment includes a main memory coupled to multiple first storage devices, wherein ones of the first storage devices are third-level cache memories, and other ones of the storage devices are Input/Output (I/O) Buffers. These first storage devices each stores addressable portions of data signals retrieved: from the main memory. A directory-based coherency scheme is employed to ensure that the memory system stores a single, most recent copy of all data signals. According to this scheme, a directory associated with the main memory records the location of the latest copy of any of the data signals stored in the memory system. When a request issued by one of the storage devices is received by the main memory, the directory is consulted to determine which storage device stores the most recent copy of the requested data signals. In some instances, the main memory issues a request to retrieve this latest copy of the data signals from another target storage device in the system so the data can be forwarded by the main memory to the original requester.
To facilitate a more efficient transfer of data between the various storage devices in the memory system, the system includes at least one by-pass interface coupling associated ones of the first storage devices. Data retrieved from a target one of the first storage devices in response to a main memory request can be routed to a different requesting one of the first storage devices via the by-pass system without requiring the use of the main memory data interfaces. The by-pass system includes a control mechanism that performs the routing function based on the identity of the original requester. That is, the request from main memory to the target one of the first storage devices includes the identity of the storage device that issued the original request. If data is returned from the target storage device, a by-pass operation is enabled if the identified requester is one of the storage devices associated with the by-pass interface. According to one embodiment of the invention, the requested data signals are also provided by the target storage device to the main memory only if these data signals comprise an updated copy of the data stored in main memory. This reduces traffic on the main memory interfaces while allowing the main memory to retain an updated data copy. The by-pass system also provides an indication of the occurrence of any by-pass transfer operations to the main memory so that the directory can be updated to reflect the new location of any addressable portion of the data signals.
According to another aspect of the hierarchical memory system, ones of the first storage devices are each coupled to respective second storage devices. In the preferred embodiment, these second storage devices are each second-level cache memories. Each of the second storage devices store data signals retrieved from the coupled first storage device. Requests to retrieve data signals may be provided by a second storage device to a respectively coupled first storage device to be forwarded for processing to main memory. In a manner similar to that discussed above, the main memory may be required to retrieve the latest copy of the requested data signals from a different one of the storage devices in the system, including possibly one of the second storage devices, before the request can be completed.
To make the return of data signals between the second storage devices more efficient, the by-pass system includes at least one interface that allows for the transfer of data directly between predetermined first and second ones of the second storage devices. These by-pass operations are performed in a manner that is similar to that discussed above. The by-pass interfaces thereby significantly reduce the length of the return path during the transfer of data from a target to a requesting storage device. According to one embodiment, data signals are only returned to the main memory when these signals have been modified to reduce traffic on the interfaces in the manner discussed above. In all instances, an indication is provided to the main memory of any by-pass operations so that the directory status may be updated.
The system of the preferred embodiment includes multiple by-pass interfaces coupling respectively associated ones of the second storage devices, and other multiple by-pass interfaces interfacing respectively associated ones of the first storage devices. According to one aspect of the system, the circuits to identify the respectively associated ones of the storage devices are programmable.
According to yet another aspect of the system, the main memory is modular. The by-pass system is adapted to receive data requests from each of the main memory modules for use in generating by-pass responses. The by-pass system is further adapted to route updated data returned from a target storage device to an addressed one of the main memory modules.
The by-pass system includes logic that is capable of transferring various predetermined types of data copies between the storage devices of the hierarchical memory depending on the type of the original data request. In some instances, the by-pass system transfers shared, read-only data copies, whereas in other instances, an exclusive read-write copy is provided. In all situations, an indication of the type and location of each portion of the transferred data signals is provided to the directory so that data coherency is maintained.
One embodiment of the by-pass system allows by-pass responses to be generated before it is known whether a data by-pass operation may be completed. In this embodiment, a by-pass response is generated by the by-pass system upon receipt of a request from the main memory to retrieve specified data from a predetermined target storage device, and before it is known whether the requested data is available within that target storage device. If the data signals can be made available by the target storage device, the pre-generated response is available to be appended to the data signals for immediate routing to the requesting unit. In some instances, the target storage device may not store the requested data signals. This occurs when the target storage device writes the requested data back to main memory after the main memory issued the request to retrieve this data but before the request is received by the target device. In these instances, the pre-generated by-pass response is discarded, and the original request must be processed by main memory instead of by the by-pass system.
Still other objects and advantages of the present invention will become readily apparent to those skilled in the art from the following detailed description of the preferred embodiment and the drawings, wherein only the preferred embodiment of the invention is shown, simply by way of illustration of the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various respects, all without departing from the invention. Accordingly, the drawings and description are to be regarded to the extent of applicable law as illustrative in nature and not as restrictive.