A computer system configuration in which a plurality of computers all access a collection of shared memory (i.e., disks) is called a "shared disk system". Restated, in a shared disk environment, multiple nodes (computers) of a cluster concurrently write to one or more shared disks. A system that exploits such a shared environment, such that multiple nodes access common data, is called a "data sharing" system. A distinguishing feature of data sharing and shared disk systems from other systems is that a data item can reside in the caches of multiple computers (nodes) and can be updateable from these nodes. That is, there is no one computer used as a server or coordinator; all system computers are peers in their access to the shared data.
FIG. 1 illustrates a typical data sharing environment of a computer system 11. A memory storage 14 comprises persistent storage (p-store) and thus is the system storage whose contents persist when part or all of the computer system 11 crashes. Traditionally, this storage 14 has been magnetic disk. A "page" of persistent storage is an automically updateable entity. An integral number of pages of persistent storage forms a block 18. Blocks 18 are the recoverable objects/entities of the computer system 11. Each block 18 is identified by a unique block identifier (BID).
Computer nodes 10A, 10B, and 10C of computer system 11 access memory storage 14 via a storage interconnect subsystem 19. Each node 10A, B, C supports one respective local cache 12A, 12B, 12C. Each cache 12 is a simple shared memory and the corresponding computer/node 10 may have multiple processors. Each cache 12A, B, C is formed of volatile storage for holding certain data blocks 18. A node 10 can operate only on blocks 18 which have been read from memory storage 14 into respective local cache 12. Because the local cache 12 is formed of volatile storage, the contents (i.e., locally stored data blocks 18) of the cache 12 may be lost during a system 11 crash.
Each node 10A, 10B, 10C has a respective, dedicated transaction log 16A, 16B, 16C. Each transaction log 16A, B, C is formed of a series of records, one record for each operation (or action) performed by the respective node 10 on corresponding cache data.
In the case of prior art data base systems, each node 10 is a cooperating computer running software to implement a database system. To that end, a node 10 reads a desired data block 18 from memory storage 14 into the node's local cache 12. The node 10 writes into a respective transaction log 16 logical changes to (i.e., operations on) data blocks 18 stored in local cache 12. Each transaction log 16 is a sequential file used to record information that permits an operation of a transaction to be performed again (redone), should the system 11 crash prior to committed data being written to memory storage 14. Recovery involving the transaction log 16 assures the durability of transaction effects in memory storage 14.
That is, after normal operation of the nodes 10 writing entries to respective transaction logs 16, computer system 11 writes the data back to memory storage 14. A distributive lock manager coordinates who (which node 10) updates a data block of 18 for the purposes of serializing operations on the data blocks 18. If the system 11 crashes during the time of data being updated and written back to memory storage 14, a recovery mode of operation must be undertaken.
The transaction logs provide recovery of state of the database after a system crash or failure. The illustrated data sharing system 11 utilizes one transaction log 16 for each node/computer 10. However, single log recovery approaches also exist. In the case of a single/central transaction log, a synchronization mechanism is required to enable the multiple nodes to write to the single transaction log during operation of the nodes. An example of a single log system is Microsoft's NTFS.
However, better scalability and performance is achieved using a multiple log approach as disclosed by David B. Lomet in "Recovery for Shared Disk Systems Using Multiple Redo Logs," Digital Equipment Corporation Cambridge Research Lab, Report CRL 90/4, Oct. 1, 1990. In Lomet, a separate transaction log 16 for each node 10 is used. This avoids the synchronization issues of the single log approaches. Lomet describes the technique of storing so called "before state identifiers" (BSI's) in each transaction log record. This means that each transaction log record contains the precise identity of the data block state that is seen before the logged action is performed.
The process of recovery makes all blocks 18 "current" (i.e., all updates to the blocks have been applied). Transaction log records are applied to a block 18 until there are no more transaction log records for that block 18. At this point the block 18 is said to be current. Further, the block 18 is considered to be "One-Log" because all transaction log records that need to be applied to the block 18 are held in a single transaction log 16. More specifically, the block is required to be One-log and a certain protocol is used to achieve this. Namely, the block is written to disk when ownership moves from one node to another.
A state identifier is used in Lomet to determine whether or not to apply a particular transaction log record. A state identifier is a definitive description of the block 18 contents and is much smaller than the complete state. A state identifier is defined by the system 11 storing a defining state identifier (DSI) in block 18. The DSI denotes the block state of the block in which the DSI is included.
During recovery, the system 11 knows whether a transaction log record applies to a block state by testing for equality between the transaction log record BSI and the defining state identifier (DSI) in the block. If the transaction log record BSI is determined to equal the block DSI, then the operation recorded in the transaction log record is executed on the data to transform the subject data block of the transaction record from its present state to a successor state.
As such, Lomet's scheme avoids concurrency control and coordination required for N-log recovery, by enforcing one-log redo and maintaining the independent redo recovery assumption. That assumption provides that the state of a data block 18 is able to be transformed to a successor state during redo recovery based solely on the state of the block at the time the original operation was applied and the contents of the transaction log record that describes the operation.
Using BSI's in this manner means that it is possible to identify which transaction log record to apply to a block 18 in some state, independently of transaction log 16 ordering or the number of transaction logs 16. One does not rely on the relative ordering of records between transaction logs 16; the test of determining whether or not to apply a transaction log record to a block 18 is inherent in their (the transaction log records) states. With the uniqueness of state identifiers and the one-log condition, during recovery, only one transaction log will contain a transaction log record with a BSI that matches a block's DSI.
Of further note, in a typical database system, the content of every block 18 is directly managed by the database system 11. Therefore, it is practical to include a state identifier in the definition of every block. Each block may be initialized in memory storage 14 when it is first allocated to the database. Initialization is an expensive operation, but it need not occur during normal operation of the database. Alternatively, the initial random state found in the data block 18 could be used as its initial state identifier. This requires that the block be read from memory storage 14 before its first use.
When a free (not allocated for use) data block 18 is reused, the previous contents are of no interest. However, Lomet's scheme requires that the block be read from memory storage 14 in order to generate a new unique state identifier for the block. In some applications this may be a significant performance cost.
In addition to Lomet's recovery method for database systems, other recovery methods/approaches exist for file systems in shared disk/data environments. In particular, file system specific recovery approaches for a shared disk environment include Open VMS's Files-11XQP and Digital Equipment Corporation's Frangipani. See McCoy, Kirby, "VMS File System Internals", Digital Press, 1990; and Thekkath, Mann and Lee, "Frangipani: A Scalable Distributed File System", Proceedings of the 16th ACM Symposium on Operating Systems Principles, pp. 224-237, ACM Press (October 1997). The Files-11XQP design uses a technique called safe writing which avoids the use of transaction logs (generically termed "logging"). As a consequence, Files-11XQP loses the benefits of logging/transaction logs.
On the other hand, Frangipani uses a multiple log approach very similar to Lomet's scheme. Frangipani is the only previous shared disk file system known by applicant to use logging. It avoids the problems of page allocation and initialization by relying on an underlying storage manager called Petal. Because Petal is a special storage management system, use of Frangipani is limited (i.e., does not operate directly with standard disks). However, Petal can use standard disks.