1. Field of the Invention
This invention relates generally to an improved system and method for maintaining cache coherency in a data processing system in which multiple processors are coupled to a directory-based, hierarchical shared memory; and more particularly, relates to a system that allows one or more of the processors to each have multiple ownership requests simultaneously pending to the shared memory, wherein each of the ownership requests is a request to gain exclusive access to a requested, addressable portion of the memory.
2. Description of the Prior Art
Data processing systems are becoming increasing complex. Some systems, such as Symmetric Multi-Processor (SMP) computer systems, couple two or more Instruction Processors (IPs) and multiple Input/Output (I/O) Modules to shared memory. This allows the multiple IPs to operate simultaneously on the same task, and also allows multiple tasks to be performed at the same time to increase system throughput.
As the number of units coupled to a shared memory increases, more demands are placed on the memory and memory latency increases. To address this problem, high speed cache memory systems are often coupled to one or more of the IPs for storing data signals that are copied from main memory. These cache memories are generally capable of processing requests faster than the main memory while also serving to reduce the number of requests that the main memory must handle. This increases system throughput.
While the use of cache memories increases system throughput, it causes other design challenges. When multiple cache memories are coupled to a single main memory for the purpose of temporarily storing data signals, some system must be utilized to ensure that all IPs and I/O Modules are working from the same (most recent) copy of the data. For example, if a copy of a data item is stored, and subsequently modified, in a cache memory, another IP requesting access to the same data item must be prevented from using the older copy of the data item stored either in main memory or the requesting IP""s cache. This is referred to as maintaining cache coherency. Maintaining cache coherency becomes more difficult as more caches are added to the system since more copies of a single data item may have to be tracked.
Many methods exist to maintain cache coherency. Some earlier systems achieve coherency by implementing memory locks. That is, if an updated copy of data exists within a local cache, other processors are prohibited from obtaining a copy of the data from main memory until the updated copy is returned to main memory, thereby releasing the lock. For complex systems, the additional hardware and/or operating time required for setting and releasing the locks within main memory cannot be justified. Furthermore, reliance on such locks directly prohibits certain types of applications such as parallel processing.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 4,843,542 issued to Dashiell et al., and in U.S. Pat. No. 4,755,930 issued to Wilson, Jr. et al. These patents discuss a system wherein each processor has a local cache coupled to a shared memory through a common memory bus. Each processor is responsible for monitoring, or xe2x80x9csnoopingxe2x80x9d, the common bus to maintain currency of its own cache data. These snooping protocols increase processor overhead, and are unworkable in hierarchical memory configurations that do not have a common bus structure. A similar snooping protocol is shown in U.S. Pat. No. 5,025,365 to Mathur et al., which teaches local caches that monitor a system bus for the occurrence of memory accesses which would invalidate a local copy of data. The Mathur snooping protocol removes some of overhead associated with snooping by invalidating data within the local caches at times when data accesses are not occurring, however the Mathur system is still unworkable in memory systems without a common bus structure.
Another method of maintaining cache coherency is shown in U.S. Pat. No. 5,423,016 to Tsuchiya. The method described in this patent involves providing a memory structure called a xe2x80x9cduplicate tagxe2x80x9d with each cache memory. The duplicate tags record which data items are stored within the associated cache. When a data item is modified by a processor, an invalidation request is routed to all of the other duplicate tags in the system. The duplicate tags are searched for the address of the referenced data item. If found, the data item is marked as invalid in the other caches. Such an approach is impractical for distributed systems having many caches interconnected in a hierarchical fashion because the time required to route the invalidation requests poses an undue overhead.
For distributed systems having hierarchical memory structures, a directory-based coherency system becomes more practical. Directory-based coherency systems utilize a centralized directory to record the location and the status of data as it exists throughout the system. For example, the directory records which caches have a copy of the data, and further records if any of the caches have an updated copy of the data. When a cache makes a request to main memory for a data item, the central directory is consulted to determine where the most recent copy of that data item resides. Based on this information, the most recent copy of the data is retrieved so that it may be provided to the requesting cache. The central directory is then updated to reflect the new status for that unit of memory. A novel directory-based cache coherency system for use with multiple Instruction Processors coupled to a hierarchical cache structure is described in the co-pending application entitled xe2x80x9cDirectory-Based Cache Coherency System Supporting Multiple Instruction Processor and Input/Output Cachesxe2x80x9d referenced above and which is incorporated herein by reference in its entirety.
The use of the afore-mentioned directory-based cache coherency system provides an efficient mechanism for sharing data between multiple processors that are coupled to a distributed, hierarchical memory structure. Using such a system, the memory structure may be incrementally expanded to include any multiple levels of cache memory while still maintaining the coherency of the shared data. As the number of levels of hierarchy in the memory system is increased, however, some efficiency is lost when data requested by one cache memory in the system must be retrieved from another cache.
As an example of performance degradation associated with memory requests in a hierarchical cache memory system, consider a system having a main memory coupled to three hierarchical levels of cache memory. In the exemplary system, multiple third-level caches are coupled to the main memory, multiple second-level caches are coupled to each third-level cache, and at least one first-level cache is coupled to each second-level cache. This exemplary system includes a non-inclusive caching scheme. This means that all data stored in a first-level cache is not necessarily stored in the inter-connected secon-level cache, and all data stored in a second-level cache is not necessarily stored in the inter-connected third-level cache.
Within the above-described system, one or more processors are respectively coupled to make memory requests to an associated first-level cache. Requests for data items not resident in the first-level cache are forwarded on to the inter-coupled second-level, and in some cases, the third-level caches. If neither of the intercoupled second or third level caches stores the requested data, the request is forwarded to main memory.
Within the current exemplary system, assume a processor makes a request for data to the intercoupled first-level cache. The requested data is not stored in this first-level cache, but instead is stored in a different first-level cache within the system. If this request involves obtaining access to a read-only copy of the data, and the first-level cache that stores the data is storing a read-only copy, the request can be completed without involving the first-level cache that currently stores a copy of the data. That is, the request may be processed by one of the inter-connected second or third-level caches, or by the main memory, depending on which one or more of the memory structures has a copy of the data.
In addition to read requests, other types of requests may be made to obtain xe2x80x9cexclusivexe2x80x9d copies of data that can be updated by the requesting processor. In these situations, any previously cached copies of the data must be marked as invalid before the request can be granted to the requesting cache. That is, in these situations, copies of the data may not be shared among multiple caches. This is necessary so that there is only one xe2x80x9cmost-currentxe2x80x9d copy of the data existing in the system and no processor is working from outdated data. Returning to the current example, assume the request from the first-level cache is for an exclusive copy of data. This request must be passed via the cache hierarchy to the main memory. The main memory forwards this request back down the hierarchical memory structure to the first-level cache that stores the requested data. This first-level cache must invalidate its stored copy of the data, indicating that this copy may no longer be used. If necessary, modified data is passed back to the main memory to be stored in the main memory and to be forwarded on to the requesting first-level cache. In this manner, the requesting cache is provided with an exclusive copy of the most current data.
As may be seen from the current example, in a hierarchical memory system having multiple levels of cache that are not all interconnected by a common bus structure, obtaining an exclusive copy of data that can be utilized by a processor for update purposes may be time-consuming. As the number of these so-called xe2x80x9cownershipxe2x80x9d requests for obtaining an exclusively xe2x80x9cownedxe2x80x9d data throughput may decrease. This is especially true if additional levels of hierarchy are included in the memory structure. What is needed, therefore, is a system that minimizes the impact on processing throughput that is associated with making ownership requests within a hierarchical, directory-based memory system.
The primary object of the invention is to provide an improved shared memory system for a multiprocessor data processing system;
A further object is to provide a hierarchical, directory-based shared memory system having improved response times;
A yet further object is to provide a memory system allowing multiple ownership requests to be pending to main memory from a single processor at once;
Yet another object is to provide a memory system that allows multiple ownership requests to be pending from all processors in the system simultaneously;
A still further object is to provide a memory system that allows an instruction processor to continue processing instructions while multiple ownership requests are pending to main memory;
Another object is to provide a memory system that allows multiple memory write requests that were issued by the same instruction processor to be processed simultaneously by the memory while additional write requests are queued for processing by the instruction processor;
A yet farther object is to provide a memory system allowing a subsequently-issued memory read request to by-pass all pending write requests that were issued by the same processor, and to thereby allow the read request to complete without being delayed by ownership requests to main memory; and
Yet another object is to provide a memory system that ensures that multiple simultaneously-pending memory write requests from the same processor are processed in the time-order in which the requests were issued so that data coherency is maintained.
The objectives of the present invention are achieved in a memory system that allows a processor to have multiple ownership requests pending to memory simultaneously. The data processing system of the preferred embodiment includes multiple processors, each coupled to a respective cache memory. These cache memories are further coupled to a main memory through one or more additional intermediate levels of cache memory. As is known in the art, copies of main memory data may reside in one or more of the cache memories within the hierarchical memory system. The main memory includes a directory to record the location and status of the most recent copy of each addressable portion of memory.
A processor makes memory requests to its respectively-coupled cache memory. In the case of write requests, the respectively coupled cache memory must verify that ownership has already been obtained for the requested addressable portion of memory. If ownership has not been obtained, the cache memory must make an ownership request via the intermediate levels of cache memory. This request will be forwarded to main memory, if necessary, which, in turn, may be required to complete the request by invalidating a copy of the data located in another cache memory. Request processing may also require that an updated data copy be obtained from the other cache memory and forwarded to the requesting cache.
The current invention allows multiple requests for ownership to be pending from a processors respectively-coupled cache memory simultaneously. In the preferred embodiment, first request logic associated with the respectively-coupled cache memory receives a first write request from the processor. The first write request will be staged to second write request logic if another write request is not already being processed by the respectively-coupled cache. After the first request is staged, another write request may be provided to the first request logic for processing.
After being staged to the second write request logic, a determination is made as to whether ownership is available for the addressable memory portion requested by the first write request. If ownership is not available, an ownership request is made for the requested memory portion via the intermediate cache structure. While this request is being issued, a second determination is made regarding the availability of ownership for the second write request. A second ownership request is generated if ownership is again unavailable for the requested memory portion.
Eventually, ownership and any updated data associated with the first request will be provided to the requesting cache by main memory, or alternatively, by another cache memory. The first write request may then be completed to the requesting cache. After the completion of the first request, ownership for the second request is, in most cases, already available because of the concurrent request processing for the first and second ownership requests. The second write request is staged to the second write request logic and completed without delay. Thus, the time required to process the second request is, in most instances, xe2x80x9cburiedxe2x80x9d by the processing of the first request, thereby reducing the processing time for the two requests by almost fifty percent.
In the system of the preferred embodiment, ownership grants are not necessarily provided in the order in which ownership requests are made. Therefore, in the above example, ownership for the second request may become available prior to that for the first request. The current invention includes control logic to ensure that requests are processed in the order issued by the respective instruction processor, regardless of the order in which ownership is granted. This is necessary to ensure newer data is not erroneously overwritten by an older request.
According to another aspect of the invention, a write request buffer coupled to the respective cache memory is provided to receive additional pending write requests issued by the processor. The processor may continue issuing write requests until the write request buffer is full. The pending requests are processed in the order they are issued. Therefore, after the cache completes processing of the older of two simultaneously-pending write requests in the above-described manner, a predetermined one of the requests stored in the write request buffer is removed from the buffer and provided to the first write request logic to be processed by the cache.
The current invention further provides read request processing logic coupled to the respectively-coupled cache. A read request issued by the processor is received by the read request logic, and is processed, in most cases, before processing completes for any of the multiple pending write requests. An exception to this rule exists for a read request that requests access to the same addressable portion of memory as was requested by a previously-issued write request. In this case, the processing of the read request must be delayed until the previously-issued write operation is completed. The expedited handling of read requests is performed because, in the system of the preferred embodiment, an instruction processor can not continue execution until a pending read request to memory has been completed. In contrast, outstanding write requests do not cause the processor to xe2x80x9cstallxe2x80x9d in this manner, and processor execution may continue even if multiple outstanding write requests are pending to memory.
Still other objects and advantages of the present invention will become readily apparent to those skilled in the art from the following detailed description of the preferred embodiment and the drawings, wherein only the preferred embodiment of the invention is shown, simply by way of illustration of the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various respects, all without departing from the invention. Accordingly, the drawings and description are to be regarded to the extent of applicable law as illustrative in nature and not as restrictive.