Modern computer systems utilize various technologies and architectural features to achieve high performance operation. These technologies and architectural features include reduced instruction set computers, high speed cache memories and multiprocessor systems. Innovative arrangements of high performance components embodying one or more of the above can often result in significant improvements in the capabilities and processing power of a computer system.
A reduced instruction set computer (RISC technology) represents a "back to basics" approach to semiconductor chip design. An instruction set comprises a set of basic commands for fundamental computer operations, such as the addition of two data values to obtain a result. The instructions of an instruction set are typically embedded or hard wired into the circuitry of the chip embodying the central processing unit of the computer, and the various statements and commands of an application program running on the computer are each decoded into a relevant instruction or set of instructions of the instruction set for execution.
LOAD, ADD and STORE are examples of basic instructions that can be included in a computer's instruction set. Such instructions may be used to control, for example, the movement of data from memory to general purpose registers, addition of the data in the registers by the arithmetic and logic unit of the central processing unit, and return of the result to the memory for storing. In recent years, with significant advances in the miniaturization of silicon chips, chip designers began to etch more and more circuits into the chip circuitry so that instruction sets grew to include hundreds of instructions capable of executing, via hard wired circuitry, sophisticated and complex mathematical and logical operations.
A problem with the proliferation of instructions included in an instruction set is that the increasing complexity of the circuitry required to implement a large number of instructions resulted in a slow down in the processing speed of the computer. Moreover, it was determined that a relatively small percentage of the instructions of the instruction set were performing a large percentage of the processing tasks of the computer. Thus, many of the instructions have become "expensive" options, whose relatively infrequent use does not make up for the slow down caused by large instruction sets.
The objective of a RISC design is to identify the most frequently used instructions of the instruction set and delete the remaining instructions from the set. A chip can then be implemented with a reduced, but optimal number of instructions to simplify the circuitry of the chip for increased speed of execution for each instruction. While a complex operation previously performed by a single instruction may now have to be executed via several more basic instructions, each of those basic instructions can be executed at a higher speed than was possible before reduction of the instruction set. More significantly, when the instructions retained in the instruction set are carefully selected from among those instructions performing the bulk of the processing within the computer, the RISC system will achieve a significant increase in its overall speed of operation since that entire bulk of processing will be performed at increased speed.
By way of example, in some "large" instruction set systems, twenty percent of the instructions were performing eighty percent of the processing work. Thus a RISC system comprising the twenty percent of the instructions would achieve significantly higher speeds of operation during the performance of eighty percent of the workload.
The high performance capabilities achieved in a RISC computer are further enhanced when a plurality of such RISC computers is arranged in a multiprocessor system utilizing cache memories. A multiprocessor system can comprise, e.g., a plurality of RISC computers, an I/O device and a main memory module or modules, all coupled to one another by a high performance backplane bus. The RISC computers can be utilized to perform co-operative or parallel processing as well as multi-tasking among them for execution of several applications running simultaneously, to thereby achieve dramatically improved processing power. The capabilities of the system can be further enhanced by providing a cache memory at each one of the RISC computers in the system.
A cache memory comprises a relatively small, yet relatively fast memory device arranged in close physical proximity to a processor. The utilization of cache memories is based upon the principle of locality. It has been found, for example, that when a processor accesses a location in memory, there is a high probability that the processor will continue to access memory locations surrounding the accessed location for at least a certain period of time. Thus, a preselected data block of a large, relatively slow access time memory, such as a main memory module coupled to the processor via a bus, is fetched from the main memory and stored in the relatively fast access cache memory. Accordingly, as long as the processor continues to access data from the cache memory, the overall speed of operation of the processor is maintained at a level significantly higher than would be possible if the processor had to arbitrate for control of the bus and then perform a memory read or write operation, with the main memory module, for each data access.
While the above described cached, multi-processor RISC computer system represents a state-of-the-art model for a high performance computer system, the art has yet to achieve an optimal level of performance efficiency.
One problem associated with multiprocessor systems having a cache memory at each processor of the system, is cache coherency. In a multiprocessor system, it is necessary that the system store a single, correct copy of data being processed by the various processors of the system. Thus, when a processor writes to a particular data item stored in its cache, that copy of the data item becomes the latest correct value for the data item. The corresponding data item stored in main memory, as well as copies of the data item stored in other caches of the system, becomes outdated or invalid.
In a write back cache scheme, the data item in main memory is not updated until the processor requires the corresponding cache location to store another data item. Accordingly, the cached data item that has been modified by the processor write remains the latest copy of the data item until the main memory is updated. It is, therefore, necessary to implement a scheme to monitor read and write transactions to make certain that the latest copy of a particular data item is properly identified whenever it is required for use by a processor.
One known method to provide the necessary coherency between the various cache memories and the main memory of the computer system, is to implement a SNOOPING bus protocol wherein a bus interface of each processor or other component in the multiprocessor computer system, monitors the system backplane bus for bus activity involving addresses of data items that are currently stored in the processor's cache. Status bits are maintained in a TAG store associated with each cache to indicate the status of each data item currently stored in the cache. The three possible status bits associated with a particular data item stored in a cache memory can be, e.g., the following:
SHARED--If more than one cache in the system contains a copy of the data item. A cache element will transition into this state if a different processor caches the same data item. That is, if when SNOOPING on the system bus, a first interface determines that another cache on the bus is allocating a location for a data item that is already stored in the cache associated with the first interface, the first interface notifies the other interface by asserting a SHARED signal on the system bus, signaling the second interface to allocate the location in the shared state. When this occurs the first interface will also update the state of it's copy of the data item to indicate that it is now in the shared state. PA1 DIRTY--A cache entry is dirty if the data item held in that entry has been updated more recently than main memory. Thus, when a processor writes to a location in its cache, it sets the DIRTY bit to indicate that it is now the latest copy of the data item. A broadcast of each write is initiated whenever the SHARED bit is asserted. PA1 VALID--If the cache entry has a copy of a valid data item in it. In other words, the stored data item is coherent with the latest version of the data item, as may have been written by one of the processors of the computer system.
In accordance with known SNOOPING bus protocols, when a processor writes to a data item in its cache and the data item is in the VALID, SHARED state, a write for the data item is broadcast on the system bus. Each processor having a copy of the SHARED data item in a VALID state must decide whether to accept the write from the bus to update its copy of the cached data item, or to change the state of its copy of the data item to NOT VALID.
Where several processors are on the same system bus, as in a multiprocessor computer system, as processes, i.e. jobs, migrate from one processor to another, there will be an increase in the number of cache memory locations which are held in a SHARED state in the caches of the various processors in the computer system. Whenever a cache entry is held in a SHARED state, any writes to that entry must be broadcast over the system bus in order to provide all of the processors with a copy of the data item an opportunity to update the copy in its cache.
Thus, as the number of cache entries in a SHARED state increases, an excessive number of broadcast writes over the system bus may occur resulting in an overall decrease in system performance.
One known approach to the above problem of excessive broadcast writes, due to a large number of cache entries in a SHARED state, is to implement an invalidate policy. In accordance with one known invalidate policy, all writes on the system bus cause any cache entry with a copy of that memory location to be marked NOT VALID. Such cache entries marked NOT VALID, need not be updated in the future since the cache entry has been invalidated, thus reducing the number of cache entries being marked SHARED and VALID, and, in turn, the number of broadcast writes required to maintain the shared cache entries.
Generally, the above policy of simply invalidating a cache entry, when a write to the same memory location as contained in the cache entry occurs over the system bus, improves system performance when applied to caches that are associated with processors.
However, for operating components that simply move data around in the computer system, such as processors contained on an I/O subsystem, an update policy is more beneficial to overall system performance because of the characteristic use of the data contained in such caches. In accordance with known update policies implemented for caches controlled by processing elements which are simply movers of data, a cache which contains a copy of a memory location being written to over the system bus, accepts the new data and updates the copy contained within the cache. Thus, when one of the processors contained in the computer system needs to use the data, a current copy of the data will be resident and available in the cache of the processors which serve as data movers in the computer system. Accordingly, overall system performance is increased when such an update policy is implemented.
As described above, in the known systems, update vs. invalidate determinations are based solely on the state and/or design of the processing element that is performing the SNOOP on the system bus. While such designs provide a measure of control over update v. invalidate decisions, they fail to consider the characteristic behavior of the processing element which initiated a bus write broadcast. The failure of the known systems to consider the characteristic behavior of the operating component that initiated the bus write broadcast, leads to a series of unnecessary cache updates and invalidations resulting in reduced system performance.