1. Field of the Invention
The present invention relates generally to high reliability computer complexes. More specifically, the invention relates to a device and method for providing a computer system with dual file locking and audit trail sequence number generating capabilities.
2. Description of the Prior Art
High reliability computers and computer systems are used in demanding applications including on-line airline reservation and ticketing systems, credit card systems, and banking and automatic teller machine systems. Such applications require high throughput transaction processing and short response times. The applications require high reliability as users are unable to transact business while the computer system is down. The nature of the transactions require access from geographically diverse sites to a central shared database, as the transactions act upon the same shared database and cannot be completely distributed.
The financial nature of the applications requires high reliability in that transactions cannot be lost, performed twice, or performed only in part. This requires rigorously designed hardware and software including database locking and audit trail capabilities. The requirements of operating upon a central database, high reliability, high availability, a fast response combine to require solutions not possible by simply adding more host processors ad hoc.
The use of computer clusters as a means for adding more host processors is becoming more common, particularly the use of multiple host computers in place of an equivalently sized single larger host computer. The use of multiple hosts provides incremental scalability in both performance and reliability. Specifically, additional capacity in throughput and storage capacity can be obtained by adding an additional host, rather than removing one host and replacing the host with an even larger host. The loss of one of several hosts for planned or unplanned reasons is more tolerable in a system with several hosts than a single host. Access to files in a computer system requires coordinated access among multiple processes. Such coordination within a single computer is accomplished using well known methods such as file locking of varying granularities. Communication among several processes is ultimately directed by the operating system or database manager.
The use of sequence numbers in database management is well known. The relative order of actions upon a database are important at two levels. At a finer level, the relative ordering is important to allow a rollback of the individual actions upon a database. An example of such a rollback is the undoing of updates to a data base when a transaction dies at a point that would leave the database in an inconsistent state. Such a rollback would be required when a disk drive died midway through a funds transfer. (e.g., during a transfer of funds one account has been debited but the other account has not yet been credited.) The debit can be undone and the transaction tried again, perhaps on a mirrored copy of the failed disk. Timestamps are often used for this purpose, even though the relative ordering is more important than the wall clock time.
At a coarser level, sequence numbers are used as an audit trail to return a database to a recent state given an initial consistent state and a series of consistent transactions. An example of this would be bringing the disks of a downed host up to date, given an initial state and a list of account transfers that have occurred since the disks were taken down. This coarser use of sequence numbers is referred to in the present application as the use of "Commit Sequence Numbers" (CSNs).
In a multi-host environment, both file locking and sequence number generation become more difficult. If the hosts are peers on a network, there is no equivalent of an operating system overseeing all file access as there is among processes within a single computer. While a multiplicity of hosts increases system power and scalability, it complicates contention problems for shared files as host run asynchronous of one another and can have different opinions as to whether a file is locked at the same instant in time.
This complication can be countered by using a central "outboard" device to control access to shared files, using file locking which resides in the central device. The term "outboard" refers to the fact that the device is located outside of the host input-output boundary. The design and use of an outboard file cache system is described in commonly assigned U.S. patent application Ser. No. 08/174,750, filed Dec. 23, 1993, now U.S. Pat. No. 5,809,527, entitled "Outboard File Cache System", which is herein incorporated by reference (hereafter referred to as the "OFC" application). The device as disclosed in the OFC application is used to cache files. The outboard device is a fully functional computing device and is used in the present invention to generate sequence numbers and perform inter-host messaging as well.
U.S. Pat. No. 5,140,685 describes a precursor outboard device to the XPC which included file locking, sequence number generation, and messaging. U.S. patent application Ser. No. 08/174,750, now U.S. Pat. No. 5,809,527 describes an outboard file cache device. U.S. patent application Ser. No. 08/779,681, filed Jan. 7, 1997, titled "Dual XPCS for Disaster Recovery" (RA-3431) describes the use of dual outboard file cache devices. All of the above, commonly assigned patents and applications are herein incorporated by reference.
Sequence number generation from multiple hosts operating on the same shared files becomes a problem as different hosts have slightly different internal times and have different internal counters. Synchronizing either times or counters between hosts executing millions of instructions per second is problematic. Even where it can be done, the forced lock-step execution penalty is severe. A solution is to use a central outboard device to issue sequence numbers upon request to any of the several hosts in a system. The central device can be an outboard device, residing away from the main bus of any of the hosts.
In using a central outboard device for either file locking or sequence number generation, the central device is potentially a single point of failure. Physically adding a second outboard device in communication with each of the hosts is possible, but creates several problems that have not heretofore been solved.
One problem includes coordinating or dealing with the sequence numbers generated by asynchronously running outboard devices. Another problem includes reliably handling outboard devices dying and coming on line. Yet another problem relates to file locking in a system having a single outboard device, file locking in a system initially having two un-synchronized outboard devices, file locking in a system having two un-synchronized outboard devices, and transitions between these systems without losing locks. In particular, a high availability system can have some hosts aware of two outboard devices and other hosts unaware.