1. Field of Invention
The present invention relates to the field of computer systems. More particularly, this invention relates to a system and method for enhancing the reliability of a computer system by combining a cache sync-flush engine with a replicated memory module. The cache sync-flush engine is the logic that facilitates the flushing operation from a cache to a memory complex.
2. Description of Related Art
A computer system typically includes a memory and a processor. The memory generally includes a main memory and a cache memory for storing data and instructions for the processor. The cache memories store blocks of data and/or instructions that are received from the main memory. Typically, instructions from the main memory that are used by the processor are stored in the instruction cache and the data for that particular instruction is stored in the data cache.
To execute a specific instruction in an application, the processor may issue read and write operations to the memory. During a read operation, the processor first checks its local cache for the address corresponding to the read operation. If the address is found, the processor retrieves the data from its local cache. If the address is not found, the processor searches for the data in the main memory. Once the data has been located, the processor makes a copy of the data from main memory and stores the copied data in its local cache. Since a read operation does not change the content of the data, the copied data in the cache is identical to the data in the main memory. The copying of data from the main memory results in several read-only copies of the same data existing in multiple caches. The cached copies of data are sometimes referred to as clean copies.
During a write operation, the processor first checks its local cache for the address corresponding to the write operation. If the address is found, the processor replaces the data in its local cache. If the address is not found, the processor searches for the data in the main memory. Once the data has been located, the processor retrieves the data from main memory, stores the data in its local cache, and invalidates all other cached copies of the data. The processor that retrieved the data is the owner of the data and has an exclusive and most recent copy of the data. This data may be modified when it is in the processor""s local cache. The main memory now holds an obsolete value of the data.
A problem arises when the processor that owns the data or the cache or main memory that hold the data fails. These failures cause loss of the most recent value of the data and may significantly impact individuals and businesses. Furthermore, businesses can suffer significant monetary losses when processors and memories fail.
In order to avoid these failures, redundant systems have been developed. These systems are designed to have multiple redundancies to prevent or minimize the loss of information should the processor or memory fail. These redundant systems are also referred to as fault tolerance systems. One type of redundant system duplicates the entire hardware system. That is, all the hardware components are mirrored such that the duplicate components perform the same functions as the main system but are transparent to the main system. Duplicating the hardware components is a practice that is used by many designers to further enhance the reliability of computer systems. For example, if the main computer system fails, the redundant computer system having the same hardware components continues to process the data, thus eliminating the loss of data and the disruption in processing the data. The redundant computer systems run directly in parallel and in sync with the main computer system. Hence, there are multiple processors and multiple memories executing the same instructions at the same time. These systems provide additional reliability which minimize the number of computer system failures. The duplication of all of the hardware components, however, significantly increases the costs associated with manufacturing the computer system.
It should therefore be appreciated that there remains a need for a computer system that can have the same or better reliability as prior systems without the cost of replicating entire hardware systems. The present invention fulfills this need.
The present invention is embodied in a computer system, and related method, for enhancing the reliability of a computer system by combining a cache sync-flush engine with a replicated memory module. Architecturally, the computer system includes a number of nodes coupled to a shared memory via an interconnect network. Each node has a number of processors and caches which are connected to a system control unit via a common bus. The shared memory has a number of replicated memory modules for storing identical copies of data.
The related method includes placing or issuing a xe2x80x9clockxe2x80x9d command on the common bus. The lock protects or controls accesses to a number of memory locations in the memory modules designated by the programmer. At any point in time, one processor can obtain the lock, and hence has access to the number of memory locations protected by the lock. Other processors may attempt to acquire or make a request for the same lock, however, the other processor will fail until the processor that has the lock has released (i.e., xe2x80x9cunlockedxe2x80x9d) the lock. The other processors will keep trying to get the lock. The processor that obtains the lock instructs the system control unit to begin logging or monitoring all subsequent memory addresses that appear on the common bus. After the processor gets the lock, it can start reading from and writing to the number of memory locations that implemented as a number of replicated memory modules. A data value is then determined based on the data held by a majority of the replicated memory modules. The data value is transmitted to the cache of the processor. After the data is processed, an xe2x80x9cunlockxe2x80x9d command is transmitted from the processor to a system control unit that issues a write back request on the common bus that flushes the data value from the cache to the number of replicated memory modules.
Other features and advantages of the present invention will be apparent from the detailed description that follows.