A database is made up of one or several tables of data. The table may be subdivided into fragments. The fragments are made up of records (also called tuples). In a parallel system the fragments may be stored in a number of different nodes that communicate over a local network. The nodes are managed by a centralized management system. A common way to protect the system from loosing data when a node fails is to make replicas of each fragment in the system and store one replica in one node as a primary replica and at least one other replica in at least one other node as backup. By keeping backup copies of data the system may continue to function even when a node fails and a primary copy of data or a backup copy is lost. This is possible since the data that is lost through the failure also is contained in other nodes in the system.
After a node failure, it is desirable to recover the node by rebuilding the fragment or fragments that the node contained before the failure. An important part of the recovery is to make sure that the fragments of the recovered node are up to date. Transactions such as updates, inserts and deletes will have taken place while the node was down. Transactions are often also allowed to continue during the recovery process. It is necessary to take all such transactions into account in the rebuilding of the fragments. If a transaction, which was performed during the node failure or while the recovery was in progress, is missed, the recovered node will not be up to date.
There are several known methods for performing node recovery. One such method is the so-called “copy method”. The copy method is simple and straightforward. By holding replicas of all data on different nodes in the system, the data to be recovered will exist on a node other than the failed node. A new fragment is built from scratch on the recovering node by copying the corresponding fragment from the other node on which the fragment exists. One way of performing the copying is to copy and transfer one record at a time to the recovering node. If all write transactions are stopped during the recovery the new fragment will be an up to date version when all records are copied from the existing fragment. If write transactions are allowed during the recovery, arrangements must be made so that both the node holding the existing fragment and the recovering node receives the requested write transaction. In order to avoid inconsistencies the record is locked so that no write transactions may be performed on it while it is being copied and transferred to the recovering node. If the above is performed in a careful manner the recovered node will be up to date when all records are copied and transferred, without stopping write transactions during the recovery.
Another known method of performing node recovery is a method in which log records are executed. In this method, the nodes keep logs in which the log records are stored. The log records contain information regarding how corresponding transactions, such as inserts, deletes or updates, have changed the database fragments. When a node fails, log records corresponding to transactions made to the fragments of the failed node, are stored in a log in at least one other functioning node, at least from the time the node failed until it has recovered. There are many different ways in which to generate log records. In order to be able to use the log records in the node recovery process it must be possible for log records generated at a node that is alive to execute on the recovering node. Instead of rebuilding the lost fragment from scratch as in the copy method, it is assumed in this method that an old version of the fragment is available in the recovering node. The old version of the fragment may for instance be a version that was stored on disk before the node failure. The old version of the fragment may lack a number of records that have been inserted during the node failure. Further, it may still contain records that have been deleted, and it may contain a number of records that are out of date since they have been updated while the recovering node was down. Owing to the fact that logs are kept of all transactions made, there will be an active node, which contains the log records of the transactions that have been performed since the node failed. These log records will bring the recovering node up to date if they are executed on the recovering node and if no write transactions are allowed during the recovery process.
To disallow write transactions during the recovery process is highly undesirable. The method can be made more attractive by allowing write transactions, such as insert, delete and update, during the recovery process, but this will increase the complexity of the method. After the log records that correspond to transactions being made from the time the recovering node failed until the recovery process started have been executed on the recovering node, transactions that have been performed from the recovery process started until a first point in time are executed. Thereafter another iteration may be made in which log records, which were generated from the first point in time until a second point in time, are executed. The process may continue with several iterations and each iteration will hopefully bring the fragment of the recovering node closer to the real version of the fragment. It is however hard for the recovering node to catch up with the real version without stopping write transactions at least for a short time.
R. Agrawal and D. Dewitt, “Integrated Concurrency Control and Recovery Mechanisms: Design and Performance Evaluation”, ACM Transactions on Database systems, Vol. 10, No. 4, Dec. 1985, pages 529-564, describes three basic recovery mechanisms using logs, shadows or differential files respectively. The recovery mechanism using logs corresponds to the log method described above and the recovery mechanism using shadows corresponds to the copy method described above. The use of differential files involves keeping a local differential file and a global differential file for storing updates. Before a transaction is committed its updates go to the local differential file. When the transaction commits the local differential file is appended to the global differential file and a timestamp of the committing transaction is written to a CommitList. In case of a recovery only transactions with timestamps that appear in the CommitList are taken into account.
The European patent application EP0758114A1 describes recovery processing using log records in a system in which a plurality of servers cooperatively perform distributed processing of transactions.
The known methods for node recovery described above suffer from a number of drawbacks. With the copy method, the time for recovery grows with the size of the fragments to be recovered. Thus the copy method may become too slow when the fragments to be recovered are very large. With the method using log records it will most likely be necessary to stop write transactions for at least a short time, which is highly undesirable. If write transactions are allowed during the recovery process, new log records will be generated while old log records are being executed.
Handling this increases the complexity of the method. Another drawback of the method using log records is that the log may grow very large if the period between node failure and recovery is long. If the maximum size of the log is reached, the log will not be able to store any new log records, which leads to that write transactions will have to be stopped.