Modern computer systems utilize various technologies and architectural features to achieve high performance operation. These technologies and architectural features include reduced instruction set computers, high speed cache memories and multiprocessor systems. Innovative arrangements of high performance components embodying one or more of the above can often result in significant improvements in the capabilities and processing power of a computer system.
A reduced instruction set computer (RISC technology) represents a "back to basics" approach to semiconductor chip design. An instruction set comprises a set of basic commands for fundamental computer operations, such as the addition of two data values to obtain a result. The instructions of an instruction set are typically embedded or hard wired into the circuitry of the chip embodying the central processing unit of the computer, and the various statements and commands of an application program running on the computer are each decoded into a relevant instruction or set of instructions of the instruction set for execution.
LOAD, ADD and STORE are examples of basic instructions that can be included in a computer's instruction set. Such instructions may be used to control, for example, the movement of data from memory to general purpose registers, addition of the data in the registers by the arithmetic and logic unit of the central processing unit, and return of the result to the memory for storing. In recent years, with significant advances in the miniaturization of silicon chips, chip designers began to etch more and more circuits into the chip circuitry so that instruction sets grew to include hundreds of instructions capable of executing, via hard wired circuitry, sophisticated and complex mathematical and logical operations.
A problem with the proliferation of instructions included in an instruction set is that the increasing complexity of the circuitry required to implement a large number of instructions resulted in a slow down in the processing speed of the computer. Moreover, it was determined that a relatively small percentage of the instructions of the instruction set were performing a large percentage of the processing tasks of the computer. Thus, many of the instructions have become "expensive" options, whose relatively infrequent use does not make up for the slow down caused by large instruction sets.
The objective of a RISC design is to identify the most frequently used instructions of the instruction set and delete the remaining instructions from the set. A chip can then be implemented with a reduced, but optimal number of instructions to simplify the circuitry of the chip for increased speed of execution for each instruction. While a complex operation previously performed by a single instruction may now have to be executed via several more basic instructions, each of those basic instructions can be executed at a higher speed than was possible before reduction of the instruction set. More significantly, when the instructions retained in the instruction set are carefully selected from among those instructions performing the bulk of the processing within the computer, the RISC system will achieve a significant increase in its overall speed of operation since that entire bulk of processing will be performed at increased speed.
By way of example, in some "large" instruction set systems, twenty percent of the instructions were performing eighty percent of the processing work. Thus a RISC system comprising the twenty percent of the instructions would achieve significantly higher speeds of operation during the performance of eighty percent of the workload.
The high performance capabilities achieved in a RISC computer are further enhanced when a plurality of such RISC computers is arranged in a multiprocessor system utilizing cache memories. A multiprocessor system can comprise, e.g., a plurality of RISC computers, an I/0 device and a main memory module or modules, all coupled to one another by a high performance backplane bus. The RISC computers can be utilized to perform co-operative or parallel processing as well as multi-tasking among them for execution of several applications running simultaneously, to thereby achieve dramatically improved processing power. The capabilities of the system can be further enhanced by providing a cache memory at each one of the RISC computers in the system.
A cache memory comprises a relatively small, yet relatively fast memory device arranged in close physical proximity to a processor. The utilization of cache memories is based upon the principle of locality. It has been found, for example, that when a processor accesses a location in memory, there is a high probability that the processor will continue to access memory locations surrounding the accessed location for at least a certain period of time. Thus, a preselected data block of a large, relatively slow access time memory, such as a main memory module coupled to the processor via a bus, is fetched from the main memory and stored in the relatively fast access cache memory. Accordingly, as long as the processor continues to access data from the cache memory, the overall speed of operation of the processor is maintained at a level significantly higher than would be possible if the processor had to arbitrate for control of the bus and then perform a memory read or write operation, with the main memory module, for each data access.
While the above described cached, multi-processor RISC computer system represents a state-of-the-art model for a high performance computer system, the art has yet to achieve an optimal level of performance efficiency.
One problem associated with multiprocessor systems having a cache memory at each processor of the system, is cache coherency. In a multiprocessor system, it is necessary that the system store a single, correct copy of data being processed by the various processors of the system. Thus, when a processor writes to a particular data item stored in its cache, that copy of the data item becomes the latest correct value for the data item. The corresponding data item stored in main memory, as well as copies of the data item stored in other caches of the system, becomes outdated or invalid.
In a write back cache scheme, the data item in main memory is not updated until the processor requires the corresponding cache location to store another data item. Accordingly, the cached data item that has been modified by the processor write remains the latest copy of the data item until the main memory is updated. It is, therefore, necessary to implement a scheme to monitor read and write transactions to make certain that the latest copy of a particular data item is properly identified whenever it is required for use by a processor.
One known method to provide the necessary coherency between the various cache memories and the main memory of the computer system, is to implement a SNOOPING bus protocol wherein a bus interface of each processor or other component in the multiprocessor computer system, monitors the system backplane bus for bus activity involving addresses of data items that are currently stored in the processor's cache. Status bits are maintained in a TAG store associated with each cache to indicate the status of each data item currently stored in the cache. The three possible status bits associated with a particular data item stored in a cache memory can be, e.g., the following:
SHARED--If more than one cache in the system contains a copy of the data item. A cache element will transition into this state if a different processor caches the same data item. That is, if when SNOOPING on the system bus, a first interface determines that another cache on the bus is allocating a location for a data item that is already stored in the cache associated with the first interface, the first interface notifies the other interface by asserting a SHARED signal on the system bus, signaling the second interface to allocate the location in the shared state. When this occurs the first interface will also update the state of it's copy of the data item to indicate that it is now in the shared state.
DIRTY--A cache entry is dirty if the data item held in that entry has been updated more recently than main memory. Thus, when a processor writes to a location in its cache, it sets the DIRTY bit to indicate that it is now the latest copy of the data item. A broadcast of each write is initiated whenever the SHARED bit is asserted.
VALID--If the cache entry has a copy of a valid data item in it. In other words, the stored data item is coherent with the latest version of the data item, as may have been written by one of the processors of the computer system.
In accordance with known SNOOPING bus protocols, when a processor writes to a data item in its cache and the data item is in the VALID, SHARED state, a write for the data item is broadcast on the system bus. Each processor having a copy of the SHARED data items in a VALID state must decide whether to accept the write from the bus to update its copy of the cached data item, or to change the state of its copy of the data item to NOT VALID.
Typically, the interface monitoring the backplane bus in accordance with the SNOOPY protocol ascertains whether a particular data item relating to a broadcast write is present in the cache memory system associated with the interface. This is accomplished by comparing the address for the data item of the broadcast with address TAGS for the data items stored in the cache. The address TAGS for data items currently in the cache, are stored in the TAG store. The interface also probes the status bits of a data item in the cache having a TAG that matches the address TAG of the data item relating to the broadcast write to determine if a VALID copy is present in the cache.
For simplification of data paths and fast and efficient operation, a duplicate TAG store can be arranged in close proximity to the interface. For example, in a hierarchical cache arrangement, a primary cache system is placed in close proximity to a processor. A backup cache is then arranged intermediate the primary cache and the interface. The primary cache can store a subset of the data items in the backup cache so that when a data item required by the processor is not presently in the primary cache, the backup cache is probed for the data item before a fetch from main memory. If the data item is in the backup cache, it is loaded into the primary cache and a main memory fetch is avoided.
The hierarchical cache arrangement permits the implementation of a smaller and, accordingly, relatively faster primary cache for primary use by the processor, yet reduces the number of main memory fetches by storing a superset of the data of the primary cache in the backup cache. The backup cache which is often larger and therefore relatively slower than the primary cache, can return a data item, if present in the backup cache, much faster than a main memory fetch over the backplane bus. Thus, the hierarchical cache arrangement optimizes system performance by utilizing a relatively fast memory for primary use and providing an intermediate superset of data items which can be probed before resorting to a main memory fetch.
In a hierarchical arrangement, the interface must be able to probe both the primary cache and the backup cache for VALID copies of data items relating to broadcast writes. The use of a duplicate TAG store, located at the interface, for the TAG information of the data items stored in the primary cache, facilitates an efficient probe of all cache locations within the hierarchical cache memory system. However, as should be understood, the duplicate TAG store requires a certain amount of memory space, particularly when both TAGs and status bits are duplicated.