Modern day computer systems frequently comprise a central processing unit and a memory hierarchy including a relatively large, but relatively slow main memory module and a relatively small, but relatively fast cache memory coupled between the central processing unit and main memory module. Data and instructions currently being processed by the central processing unit are temporarily stored in the cache memory to take advantage of the high speed of operation of the cache memory to thereby increase the overall speed of operation of the central processing unit.
The use of a cache memory is based upon the principles of temporal locality and spatial locality. More specifically, when a central processing unit is referring to data and instructions from a particular space within physical memory, it will most probably refer to the data and instructions from that space, and also refer to data and instructions from contiguous space, for a certain period of time. Accordingly, data blocks including the contiguous space of physical memory where data being utilized by the central processing unit reside, are placed in the cache memory to greatly decrease the time required to fetch data and instructions from those frequently referred to data blocks.
A cache memory scheme may be either a write-through cache or a write-back cache. In a write-through cache, a central processing unit writes through to main memory whenever it writes to an address in cache memory. Such write to main memory may not be instantaneous when a cache memory scheme places buffers in the path between main memory and the write-through cache. In a write-back cache, the central processing unit does not update main memory at the time of writing to its cache memory but updates memory at a later time. Thus, a write-back cache speeds up the operation of the central processing unit because it minimizes writes to main memory. For example, when the central processing unit is changing the contents of its cache memory, it will send the latest copy of written-to data to main memory before it refills the space within the cache occupied by the written-to data. In this manner, the speed of operation of the central processing unit is not slowed down by the time that would be required to update main memory after each write operation. Instead, main memory is typically updated after a number of write operations have been performed on the data block contained in the cache memory.
Many computer systems operate on the basis of the concept of a single, simple copy of data. In a multiprocessor system including several central processing units, each with its own write-back cache, incoherencies within the data arise when one of the central processing units writes to a data block in its cache memory. These incoherencies result when a particular central processing unit writes to its cache memory thereby causing main memory to have a copy of the data which is not correct until the central processing unit updates main memory.
If a particular central processing unit requests a data block currently in the cache of another central processing unit of the multiprocessor system and that data block has been written to by some other central processing unit on a write-back basis, as described above, a coherency scheme must be utilized to ensure that the latest correct copy of the data is sent to the requesting central processing unit. Typically, multiprocessor systems have implemented a so-called "snoopy" protocol in a shared or point-to-point bus configuration for the several central processing units of the system to assure that the latest copy of a data block is sent to a requesting central processing unit.
In a point-to-point bus arrangement, all of the central processing units of the multiprocessor system are coupled to main memory through a central memory controller. Each of the caches of the several central processing units and any other devices coupled to the memory controller "snoop" on (i.e., watch or monitor) all transactions with main memory by all of the other caches. Thus, each cache is aware of all data blocks transferred from main memory to the several other caches throughout the multiprocessor system. Inasmuch as the caches are coupled to main memory by a single memory controller, it is necessary to implement an arbitration mechanism to grant access to one of possibly several devices requesting access to specific blocks of main memory at any particular time. The arbitration mechanism will effectively serialize transactions with main memory and the snoopy protocol utilizes the serialization to impose a rule that only one cache at a time has permission to modify a data block.
After modification of the data block in the one cache, main memory no longer contains a valid copy of the data until it is updated by the cache having the written to block, as described above. In accordance with the snoopy protocol, the copy of the written to data block in the one cache is substituted for the main memory copy whenever another cache requests that data block prior to the update of main memory. An ownership model of the snoopy protocol includes the concept of "ownership" of a data block. A device must first request and obtain ownership of a data block in its cache before it can write to that data block. Furthermore, in accordance with the ownership model, ownership of a data block must be given when ownership of the data block is requested. At most one device can own a data block at any one time and the owner always has the valid copy of that data block. The owner can update main memory before it relinquishes ownership of the block or it can transfer ownership from cache to cache without ever having to update main memory. In either situation, coherency is maintained since only one cache contains the valid data.
By definition, ownership means the right to modify a data block. When no system device owns a data block it is said to be owned by main memory and copies of the data block may be "shared" by any of the system devices. A shared mode means the device has read only access to a shared copy of a data block residing in its cache. Since main memory owns the data block in the shared mode, all shared copies exactly match the copy in main memory and are, hence, correct copies. Once any one device other than main memory obtains ownership of a data block, no other device may share the block and all copies of the data block which are shared are invalidated at the time ownership is obtained by the one device.
It is implicit above that main memory does not respond to a request if some cache owns the data--instead the cache that has ownership of the data will supply it. Typically, this is done by having each cache search itself for the data with each bus request. If a cache finds it owns the requested data, it suppresses main memory (which could otherwise respond), typically by transmitting a signal to the memory controller, and applies the data to the bus itself.
A major drawback to the foregoing cache coherency scheme arises in multiprocessor systems where processors are grouped together as processor pairs. A common example of a processor pair is a vector processor and scalar processor. The vector processor is a specialized processor that performs floating point operations at high speeds. Such vector processor is paired off with a scalar processor which performs the standard data manipulating and handling operations of a general purpose central processing unit.
Typically, the scalar/vector processor pair is utilized to speed up the execution of program instructions. For example, all instructions for executing a program would be processed by the scalar processor. Upon detecting a vector instruction, the scalar processor would pass the instruction to the vector processor for processing and either wait for the results or continue processing. In this situation, the processor pair typically will operate on the same data. Since the vector processor is the "data hungry" processor of the processor pair, the common data will most likely reside in its cache. There is a need, however, for sharing the data stored in its cache at various stages of program execution with its scalar partner.
Known multiprocessor systems have accomplished the sharing of data between a scalar/vector processor pair by forcing the requesting processor to initiate a bus transaction, i.e., a read for ownership, to fetch the data from main memory since it is unaware that its partner has the data in its cache. Upon detecting that the requested data resides in its cache, the other processor of the processor pair would transfer over the requested block. This results in a number of extra bus cycles which will reduce the system throughput. The foregoing example of a cache coherency scheme provides a effective scheme for maintaining data coherency throughout a multiprocessor system including devices having write-back caches. However, a major drawback of the scheme is that processor pairs which operate on common data must follow the same procedures for accessing data as other processors. Thus, the maximum speed of operation theoretically possible in a system having processor pairs is diminished in practical applications since each time one of the processors of the pair needs data that is stored in the other processor's cache, it must perform a read for ownership and wait to have the data sent from its processor pair partner. Moreover, the requesting processor may not require an entire block of data but must request it and thereby deprive its processing pair from operating on the remaining portion of the data block that was not needed.