In high performance multiple processor systems comprised of processors wherein each processor contains one or more Scalar Execution Elements (SXE) and one or more Vector Execution Elements (VXE) it is highly advantageous for the SXEs to access memory through a first level data cache and for the VXEs to fetch and store data directly to system memory bypassing the data cache. This is particularly true in a "tightly coupled" MP (multiprocessor) system designed primarily for large high end scientific and engineering applications and high performance (supercomputers).
The above is true for several reasons. A cache is required for performance on scalar only applications and for those scalar portions of primarily scientific applications. For highly parallel applications dominated by VXE operations the data cache capacity, bandwidth, and cache blocking mechanisms can severely limit performance, and it is important that the VXEs "pipeline" operand requests directly to and from system memory. This introduces two significant system design requirements:
(a.) Due to VXE traffic, the system memory design must accommodate a higher number of operand requests than in a design where all requests go through a first level data cache. PA1 (b) In order to ensure "cache coherency" the system must provide a centralized Data Integrity mechanism capable of servicing a very high VXE request traffic rate. PA1 1. EX ownership when the data unit is not found in any processor's copy directory. PA1 2. EX ownership when the data unit is found changed with EX ownership in another processor's copy directory. The requested data unit is castout of the other processor's cache before it is fetched into the requesting processor's cache. PA1 3. RO ownership when the data unit is found not changed with EX ownership in another processor's copy directory, and the new request is deemed not likely to change the data unit (fetch request). Also, the found data unit is left in its cache where its ownership is changed from EX to RO. PA1 4. EX ownership when the data unit is found with RO ownership in one or more other processor's copy directories, and the new request is deemed to likely change the data unit (store interrogate request). The found data unit is invalidated in the other processor's cache. This XI operation uses a time-consuming process called "promote to exclusive". PA1 5. RO ownership when the data unit is found with RO ownership in another processor's copy directory. Also, the found data unit is left in its processor's cache with its RO ownership. PA1 6. RO ownership when the data unit is a page table entry found with RO public ownership set in the entry, regardless of the type of processor request.
Present multiple processor supercomputer designs avoid the cache coherency problem either by avoiding the inclusion of conventional scalar data caches in their design, or if they have caches, impose the coherency solution on the software operating system and/or application. If designed without a cache, the memory access is minimized by using an expensive high performance static KAM memory chip. These approaches limit the range of applications for which these designs produce high performance results and/or introduce considerable added complexity in software. Prior multiple-processor systems have used processor-private store-in L1 caches, and they have maintained the coherence of data in the system by using a set of copy directories, which are copies of all L1 cache directories. Each processor's fetch request is cross-interrogated in the copy directories of all other processors to find if any other processor has a copy of a requested data unit. This process assures that only one processor at a time can have exclusive (EX) ownership for writing in a data unit in the system. Only the one processor the at has exclusive ownership of a data unit is allowed to write into the data unit. A data unit can also have public ownership (previously called read only (RO) authority) which allows all processors to read (fetch) the data unit, but prohibits all processors from writing into the data unit.
The data coherence problem is simpler with a store-through type of cache, which requires all stores made in the L1 cache also be concurrently made in a backing memory. The memory backing the L1 private processor caches may be an L2 shared memory, or it may be the L3 main memory. The shared L2 cache may be store-in or store-through, but preferably is store-in to reduce the store bus traffic to main memory.
The store-in type of cache has been used in computer systems because it requires less bandwidth for its memory bus (between the memory and the cache) than is required by a store-through type of cache for the same frequency of processor accesses. Each cache location may be assigned to a processor request and receive a copy of a data unit fetched from system main memory or from another cache in the system. With a store-in cache, a processor stores into a data unit in a cache location without storing into the correspondingly addressed data unit in main memory, which causes the cache location to become the only location in the system containing the latest changed version of the data unit. The processor may make as many stores (changes) in the data unit as its executing program requires. The integrity of data in the system requires that the latest version of any data unit be used for any subsequent processing of the data unit.
A store-through type of cache is used only for fetching, and maintains the latest version of their accessed data units by having all store accesses change both the processor's store-through cache as well as the same data unit in a memory (another cache or main storage) at the next level in the system storage hierarchy. But the store-through characteristic of such caches do not solve the coherence problem in the system since another processor's store-through cache could contain an older version of the same data unit. Therefore, cross-interrogation of the contents of private processor caches in multiple processor systems is needed whether they are store-in or store-through when a new request is being fetched into a processor cache.
Exclusive ownership (authority to change a cache data unit) is assigned to any processor before it is allowed to perform its first store operation in a data unit. The assignment of processor ownership has been conventionally done by setting an exclusive (EX) Flag bit in a cache directory (sometimes called a tag directory) associated with the respective data unit in the cache. The EX flag bit's ON state typically indicates exclusive ownership and the off state of the EX Flag bit indicates public ownership (called. "read-only authority"). Exclusive ownership by a processor allows only it to store into the data unit, but public (read-only) ownership of a data unit does not allow any processor to store into that data unit and it is up to all processors in the system to read that data unit (which can result in multiple copies of the non-changeable data unit in different processor caches in the system).
Typically, a cache Fetches data units from its storage hierarchy on a demand basis, and a processor cache miss generates a fetch request which is sent to the next level in the storage hierarchy for fetching the data unit.
A store-in cache transmits its changed data units to main memory under control of cache replacement controls, sometimes called the LRU controls. Replacement of the data unit may occur when it has not been recently accessed in the cache, and no other cache entry is available for the new request. This replacement process is sometimes called "aging out" when a least recently used (LRU) entry is chosen to be replaced with a new request. The replacement controls cause the data unit (whether changed or not) in the selected entry to be replaced by another data unit (Fetched as a result of a cache miss). When the data unit to be replaced in the cache has been changed, it must be castout of the cache and written into another place such as main memory before it is lost by being overwritten by the newly requested data unit being fetched From main memory. For example, a processor may request a data unit not currently in the cache, which must be fetched from main memory (or from another cache) using the requested address and stored in the newly assigned LRU cache location. The cache assignment of a location for the new data unit will be in a cache location not in current use if one can be found. If all of the useable cache locations are currently occupied with changed data units, then one of them must be reassigned for the new request. But before the new data unit can be written into the cache location, a castout to main memory is required of the updated cache data unit in that location. The castout process must then be used before the new data unit is written into the cache. The castout data unit has its ownership changed from an exclusive processor ownership to a main memory ownership.
If a data unit is not changed in the cache, it is merely overlayed to replace it without any castout, since its backing copy in main memory is identical.
U.S. application patent No. 4,394,731 to Flusche et al teaches the use of an exclusive/read only (EX/RO) flag in each entry in each private processor store-in cache directory for data coherence control in a computer system. A copy directory was provided for each processor's private L1 directory to identify the respective processor ownership of all data units currently in its cache, and the set of all processor copy directories was used to recognize which processor owned, or was publicly using, a data unit being requested exclusively by another processor in the system. Cross-interrogation (XI) was the process used among the copy directories to identify which, it any, processor had exclusive or public ownership of any data unit, which was done by comparing the address of a requested data unit with addresses in all copy directories. If the requested address was found in a copy directory, it identified a processor cache having that data unit. And cross-invalidation signaling was done from the identified processor's copy directory to its L1 cache to invalidate the entry for that data unit before passing the ownership of the data unit to another processor's cache.
This XI process assured exclusivity of a data unit to only one processor at a time by invalidating any copy of the data unit found in any other processor's private cache.
Hence, only one of the plural processors in a multiprocessing (MP) system can have exclusive ownership (write authority) at any one time over any data unit. The exclusive ownership over any data unit may be changed from one processor to another when a different processor requests exclusive ownership. The prior mechanism for indicating exclusive ownership for a processor was to provide an exclusive (EX) flag bit in each L1 directory entry in a processor's private L1 cache; and the EX bit was set on to indicate which of the associated data units were "owned" by that processor. The reset state of the EX flag bit indicated public ownership, which was called "read only authority" for the associated data unit that made it simultaneously available to all processors in the system. Thus, each valid data unit in any processor's private LI cache had either exclusive ownership or public ownership.
The copy-directory XI technique of prior U.S. Pat. No. 4,394,731 automatically assigned the following ownership to a data unit fetched from main storage into a processor's private L1 store-in cache:
Designs such as those illustrated in Gannon et al. U.S. Pat. No. 5,265,232, issued Nov. 23, 1993 for "Coherence Control by Data Invalidation in Selected Processor Caches Without Broadcasting to Processor Caches Not Having the Data", although logically solving the cache coherency problem, again limit performance by requiring all memory requests to go through the first level cache as explained above.
Patent application of Bean et al. (Application Ser. No. 07/680,176), filed Apr. 3, 1991, entitled "Ownership Interlock for Cache Data Units" and assigned to the same assignee describes and claims an ownership interlock control for cache data units. It interlocks a change of ownership for an exclusively-owned data unit in a store-in cache with the completion of all stores to the data unit issued by its processor up to the time it responds to a received cross-invalidate (XI) signal caused by another processor requesting the data unit either exclusively or witch public ownership.
The object of this invention is to provide a means and method of solving the cache coherency problem and still allowing "pipelined" operand VXE requests directly to system memory thus preserving the advantages available from scalar data caches and also yielding the advantages of conventional supercomputer designs. With this approach the resulting hardware system design is optimized over a much broader range of applications and does not impose on system software the task of ensuring data integrity.