1. Field of the Invention
This invention relates to the field of fault tolerance. More particularly, the present invention relates to maintaining data coherence in a fault tolerant computer system.
2. Description of the Related Art
In a computer system for a realtime application such as an on-demand video server, rapid recovery from a failure of an individual component of the computer system is highly desirable in order to be able to maximize the mean time between failure (MTBF) of the computer system. One method employed to increase system MTBF is the inclusion of redundant critical components such as memory controllers.
FIG. 1A is a block diagram of a computer system including a host processor 110, a primary memory controller 120, a backup memory controller 130 and a memory 140. Memory controllers 120, 130 include caches 125, 135, respectively. Host processor 110 is coupled to controller 120, 130 via a system bus 190.
When data is transferred from host processor 110 to memory 140, duplicate copies of the data are maintained in caches 125, 135, so that should primary memory controller 120 fail during a data transfer, backup memory controller 130 can complete any outstanding data transfer to memory 140. Subsequently, backup memory controller 130 takes over control of memory 140 until the failed primary memory controller 120 is replaced. Duplication of the data in caches 125, 135 can be accomplished using several approaches.
In one approach as illustrated by FIG. 1A, a data packet is sent by processor 110 to primary memory controller 120 for eventual transfer to memory 140, followed by a duplicate data packet from processor 110 to backup memory controller 130. Disadvantages of this approach include extra processing time and extra system bus utilization incurred by host processor 110 to send the two consecutive data packets to controller 120, 130.
FIG. 1B illustrates a second approach for maintaining duplicate data in caches 125, 135, involving adding a dedicated data link 196 between caches 125, 135. In this approach, host processor 110 is responsible for sending a single copy of the data packet to primary memory controller 120. In turn, primary memory controller 120 is tasked ensuring that a copy of the data packet is transferred from cache 125 to cache 135 of backup memory controller 130, before an acknowledgment is sent to host processor 110 indicating that data in caches 125 and 135 are now coherent. However, one drawback of this approach is the extra cost of data link 196. Further, the time delay for first sending the data packet and then executing a cache to cache transfer is not an improvement over the first approach where consecutive duplicate data packets are transferred from host processor 110 to controller 120, 130.
FIG. 1C illustrates a third and more expensive approach which involves adding hardware to host controller 110 and dedicated connections 192, 194 between host processor 110 and controllers 120, 130, respectively, enabling host processor 110 to concurrently send duplicate data packets to both controllers 120, 130. In this example, since dedicated data paths 192, 194 provide independent connections between host processor 110 and controller 120, 130, respectively, concurrent data packet transfers from processor 110 can be executed without incurring any time delay. The tradeoff in this approach is the extra hardware cost associated with duplicate data paths 192, 194.
Hence, there is a need for an effective method of providing fault-tolerant memory control system which does not unnecessarily burden the host processor nor the memory controller(s), and at minimal additional hardware cost.