1. Field of the Invention
The present invention relates to reliability in computer systems. More specifically, the present invention relates to a method and an apparatus for facilitating reliable execution in a computer system by keeping track of modifications to main memory in order to enable a rollback if an error condition arises.
2. Related Art
Reliability is critically important for some computer systems, such as computer systems that process credit card transactions or computer systems that assist air traffic controllers. These types of computer systems often include circuitry to detect error conditions. For example, computer systems often include circuitry that uses error correcting codes to detect and correct errors on-the-fly while a computer system is executing. However, providing additional circuitry to detect error conditions increases the complexity of a computer system, which can greatly increase the amount of time required to design and build a computer system.
Furthermore, even if an error is detected, it may not be possible to backtrack far enough to be able to resume execution from a prior error-free state.
In order to remedy this problem, some computer systems periodically perform checkpointing operations to save the state of a computation so that the computation can be rolled back to a prior state and restarted when an error occurs. Performing a checkpointing operation typically involves executing software to save the state of an application to an archival storage device.
Unfortunately, performing a checkpointing operation can seriously degrade the performance of a computer system because performing the checkpointing operation typically requires the application to be halted while the entire state of the application is copied to archival storage. Hence, checkpointing operations are often impractical to perform for applications that require a significant amount of computational performance.
What is needed is a method and an apparatus that facilitates rolling back a computation without spending a large amount of time performing checkpointing operations.
One embodiment of the present invention provides a system that facilitates reliable execution in a computer system by periodically checkpointing write operations to a main memory of the computer system. The system operates by receiving a write operation directed to the main memory at a memory controller, wherein the write operation includes data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written. Next, the system looks up the write address in a checkpoint store coupled to the memory controller. If the write address is not associated with any entry in the checkpoint store, the system creates an entry for the write address in the checkpoint store, and writes the data to be written to the entry. The system then periodically performs a checkpointing operation, which transfers the data to be written from the checkpoint store to the write address in the main memory.
In one embodiment of the present invention, the system additionally receives a read operation at the memory controller, wherein the read operation is directed to a read address specifying a location in the main memory to be read from. Next, the system looks up the read address in the checkpoint store. If the read address is associated with an entry in the checkpoint store, the system retrieves data from the entry in the checkpoint store to satisfy the read operation. Otherwise, if the read address is not associated with any entry in the checkpoint store, the system retrieves data from the read address in the main memory to satisfy the read operation.
In one embodiment of the present invention, the checkpoint store is organized as a cache memory.
In one embodiment of the present invention, if a new entry is to be added to the checkpoint store and no room exists in the checkpoint store for the new entry, the system performs a checkpointing operation to transfer the contents of the checkpoint store to the main memory.
In one embodiment of the present invention, the system performs the checkpointing operation by: stopping execution of a central processing unit; storing an internal state of the central processing unit to the main memory; transferring the data to be written from the checkpoint store to the write address in the main memory; and recommencing execution of the central processing unit.
In one embodiment of the present invention, the internal state of the central processing unit includes contents of internal registers in the central processing unit, and dirty cache lines associated with the central processing unit.
In one embodiment of the present invention, the system additionally delays I/O operations so that the I/O operations are performed after a subsequent checkpoint operation.
One embodiment of the present invention provides a system that facilitates reliable execution in a computer system by keeping track of write operations to a main memory of the computer system in order to undo the write operations if necessary. This system operates by receiving a write operation directed to the main memory at a memory controller, wherein the write operation includes data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written. Next, the system examines a log bit associated with the write address, wherein the log bit indicates whether an existing value from the write address in main memory has been copied to a checkpoint store. If the log bit is not set, the system creates a new entry for the write address in the checkpoint store; retrieves an existing value from the write address in the main memory; and stores the existing value to the new entry in the checkpoint store. The system then stores the data to be written to write address in the main memory. The system also periodically performs a checkpointing operation, which clears all entries from the checkpoint store.
In one embodiment of the present invention, upon receiving a read operation at the memory controller, the system retrieves data from the read address in the main memory to satisfy the read operation.
In one embodiment of the present invention, the checkpoint store is organized as a first-in-first-out (FIFO) buffer.
In one embodiment of the present invention, if an error occurs during execution of the computer system, the system restores a state of the main memory to a preceding checkpoint by replacing values that have been modified with prior values retrieved from the checkpoint store. The system also restores the internal state of the central processing unit from the main memory.