Technical Field
The disclosure relates generally to high reliability, multiple computer systems and more particularly to high reliability, multiple computer systems in which write data is processed (compared or copied) outside of checkpoint operations.
Background Art
Currently, some high reliability computers use a process known as checkpointing to keep a second computer in software lockstep with a first computer. Periodically, the first computer is stopped and the Central Processing Unit (CPU) state and any changes to the first computer's memory since the last checkpoint are transferred to the second computer. In the event of a failure or unrecoverable error on the first computer, the second computer will continue execution from the last checkpoint. Through frequent checkpointing, a second computer can take over execution of a user's application with little noticeable impact to the user.
Memory controllers are included in computer CPUs to access a separate attached external system memory. In most high performance computer systems, the CPU includes an internal cache memory to cache a portion of the system memory and uses the internal cache memory for the majority of all memory reads and writes. When the internal cache memory is full of changed data and the CPU desires to write additional changed data to the cache, the memory controller writes a copy of some of the cache content to external system memory.
High reliability computers use mirrored memory. A computer may have memory configured to be in “mirror” mode. When memory is in mirrored mode, the memory controller which is responsible for reading the contents of external memory to the CPU or writing data to the external memory from the CPU writes two copies of the data to two different memory locations, a primary and secondary side of the mirror. When the memory controller is reading the data back into the CPU, it only needs to read one copy of the data from one memory location. If the data being read from the primary side has been corrupted and has uncorrectable errors in the data, the memory controller reads the mirror memory secondary location to get the other copy of the same data. As long as the memory controller is performing a read operation, the controller only needs to read from a single memory location. Whenever the memory controller is performing a write operation (transaction), it writes a copy of the data to both the primary and secondary side of the mirror. The process of making two or more copies of data for enhanced reliability is referred to as mirroring and sometimes Redundant Array of Independent Disks (RAID 1). It is not necessary that the primary and secondary side of the mirror are on different physical memory devices.
FIG. 1 is a prior art block diagram illustrating a prior art computer system with mirrored memory. Memory modules 100, 105, and 110 are the primary side of the memory in a computer system and memory modules 120, 125, and 130 are the secondary side of the memory. Other systems have a different number of memory modules. CPU 115 includes cores and cache memory 175 (as well as other components), a primary memory controller 135 coupled to the primary memory through interface 160, and a secondary memory controller 140 coupled to the secondary memory through interface 165. Different systems have different types and numbers of interfaces. Further, the primary and secondary memory controllers 135 and 140 could be two different memory controllers or two features of a single memory controller.
In mirroring, primary memory controller 135 and secondary memory controller 140 transfer the same data to the primary and secondary side of the memory so that the data is maintained in two copies in independent memory modules after each memory write operation. During a memory read operation 145, data is transferred from a memory module 100, 105, or 110 to primary memory controller 135. In the event that the data is determined to be correct, no further actions are necessary to complete the read operation. In the event that the data is determined to be corrupted, a read 170 may be performed by the secondary memory controller 140 from a memory module 120, 125, or 130 on the secondary side of the memory which contains a copy of the data stored on the primary side of the memory. This leads to higher reliability because even if data in on the primary side of memory is corrupted, a copy may be read from the secondary side that is probably not corrupted.
Checkpointing transfers or compares changed data between the first and the second computer. High reliability computers using checkpointing transfer data between the first computer and the second computer. An interface such as InfiniBand, PCI-Express (PCIe), or a proprietary interface between the computers is used to transfer the CPU state and the system memory content during the checkpointing process. The first computer's CPU or Direct Memory Access (DMA) controller is usually used to transfer the contents of memory to the second computer. Various methods are used to save time transferring the content of memory from the first computer to the second computer. For example, a memory paging mechanism may set a “Dirty Bit” to indicate that a page of memory has been modified. During checkpointing, only the pages of memory with the Dirty Bit set will be transferred. A page could be 4 Kilobytes, 2 Megabytes, 1 Gigabyte or some other size. The DMA device or processor copies the entire region of memory that has been identified by a Dirty Bit regardless of whether the entire page has been changed or only a few bytes of data in the page have changed.
Checkpointing reduces the computer performance. While the computer is performing the checkpointing task, it generally is not doing useful work for the user, so the user experiences reduced performance. There is always a tradeoff between frequency of checkpointing intervals, complexity of the method to efficiently transfer checkpoint data, and latency delays that the user experiences. Minimum latency can be realized by only transferring the data that has been changed in the computer memory.
Checkpointing may be used when both the first computer and the second computer are executing the same instructions. When both computers are executing the same code at the same time, they may be periodically stopped and the contents of the CPU registers and memory contents compared with each other. If the computers have identical CPU register values and memory contents, they are allowed to continue processing. When both computers are comparing memory and register values, a low latency comparison exists when only the data that has been changed is compared between the two systems. Various methods have been used in the prior art to reduce the amount of time necessary to copy the contents of external memory to the second computer.