Modern day computer systems frequently comprise a central processing unit and a memory hierarchy including a relatively large, but relatively slow main memory and a relatively fast, but relatively small cache memory coupled between the central processing unit and the main memory. The data and instructions currently being processed by the central processing unit are temporarily stored in the cache memory to take advantage of the high speed of operation of the cache memory to thereby increase the overall speed of operation of the central processing unit. The use of a cache memory is based upon the principles of temporal locality and spatial locality. More specifically, when a central processing unit is referring to data and instructions from a particular space within physical memory, it will most probably, once again, refer to the data and instructions from that space and also, refer to the data and instructions from contiguous space, for a certain period of time. Accordingly, data blocks within the contiguous space of physical memory where data being utilized by the central processing unit resides, are placed in the cache memory to greatly decrease the time required to fetch data and instructions from those frequently referred to data blocks.
A cache memory scheme may be either a write-through cache or a write-back cache. In a write-through cache, a central processing unit writes through to main memory whenever it writes to an address in cache memory. In a write-back cache, the central processing unit does not update the main memory at the time of writing to its cache memory but updates the memory at a later time. For example, when the central processing unit is changing the contents of its cache memory, it will send the latest copy of written-to data to the main memory before it refills the space within the cache occupied by the written-to data. In this manner, the speed of operation of the central processing unit is not slowed down by the time that would be required to update the main memory after each write operation. Instead, the main memory is updated at the completion of all operations relating to the data block contained in the cache memory.
Many computer systems operate on the basis of the concept of a single, simple copy of data. In a multi-processor system including several central processing units, each with its own write-back cache, incoherencies within the data arise when one of the central processing units writes to a data block in its cache memory. In other words, when a particular central processing unit writes to its cache, the main memory will not have a valid copy of the data until the central processing unit updates the main memory.
Some computer systems require that a central processing unit obtain the "privilege" to perform a write before modifying a data block. For purposes of the following discussion, this so-called write privilege will be considered the same as a concept termed "ownership" and both will be described under the term "ownership".
If a particular central processing unit requests a data block currently in the cache of another central processing unit of the multi-processor system and that data block has been written to by such other central processing unit on a write-back basis, as described above, a coherency scheme must be utilized to insure that the latest copy of the data is sent to the requesting central processing unit. For example, some known multi-processor systems have implemented a so-called "snoopy" protocol in a shared bus configuration for the several central processing units of the system to assure that the latest copy of a data block is sent to a requesting central processing unit.
Pursuant to the snoopy protocol, all of the central processing units of the multi-processor system are coupled to the main memory through a single, shared bus. Each of the caches of the several central processing units and any other devices coupled to the shared bus "snoop" on (i.e., watch or monitor) all transactions with main memory by all of the other caches. Thus, each of the caches is aware of all data blocks transferred from main memory to the several other caches throughout the multiprocessor system. Inasmuch as the caches are coupled to the main memory by a single, shared bus, it is necessary to implement an arbitration mechanism to grant access to the shared bus to one of possibly several devices requesting access at any particular time. The arbitration mechanism will effectively serialize transactions with the main memory and the snoopy protocol utilizes the serialization to impose a rule that only one cache at a time has permission to modify a data block.
After modification of the data block in the one cache, the main memory does not contain a valid copy of the data until it is updated by the cache having the written-to block, as described above. In accordance with the snoopy protocol, the copy of the written to data block in the one cache is substituted for the main memory copy whenever another cache requests that data block prior to the update of the main memory.
An ownership model of the snoopy protocol includes the concept of "ownership" of a data block. A device must first request and obtain ownership of a data block in its cache before it can write to that data block. At most, one device can own a data block at any one time and the owner always has the valid copy of that data block. Moreover, the owner must update the main memory before it relinquishes ownership of the block to assure coherency of the data throughout the multi-processor system.
In the following description, ownership is defined as the right to modify (or write to) a data block. When no device of the system owns a data block it is said to be owned by the main memory and copies of the data block may be "shared" by any of the devices of the system. A shared mode means the device has read only access to a shared copy of a data block residing in its cache. Since the main memory owns the data block in the shared mode, all shared copies exactly match the main memory and are, hence, correct copies. Once any one device other than main memory obtains ownership of a data block, no other device may share the block and all copies of the data block which are shared are invalidated at the time ownership is obtained by the one device.
In some coherency protocols, there are two kinds of read commands that support the protocol. The first command (hereinafter a "read-only") requests a read-only copy of a shared data block from memory. This copy can not be modified. The second command (hereinafter a "read-for-ownership") requests a copy of a data block from memory that may be written or modified.
When a data block has been originally requested with the read-only command, and the central processing unit subsequently wants to write or modify the data block, the processing unit must rerequest the data block with the read-for-ownership command. The processing unit can only modify the data block once it gains ownership of the data block.
Most vector architectures (and some scalar architectures) use a load/store execution model in which data must first be read ("loaded") from memory or the cache into registers in the central processing unit. The data is manipulated in these registers and then written back to memory ("stored").
Many vectorizable algorithms involve the reading of a vector from locations in memory, modification of the vector, and writing the modified vector back to the same memory locations. The load and store used to read and write the vector are separate operations. This makes it difficult for the hardware to generate a read-for-ownership command for the vector load. As a result, a vector algorithm as described above causes two reads for each vector. One read of data is done with the read-only command while another read is done with the read-for-ownership command in order to gain ownership of each block of the vector as it is written back to memory. This causes a significant performance penalty in the execution of such algorithms.
The generation of two reads is avoidable by reading all vector data blocks initially, at the time of the load operation, with the read-for-ownership command. This assumes that the data will be written later for all data blocks. However, reading for ownership obligates the reading central processing unit to write the data back to main memory at some later time, even if the data is not modified. Since a significant amount of vector data is never modified, reading all data with read-for-ownership commands trades the problem of increased read traffic for the problem of increased write-back traffic.