Computers have become an integral tool used in a wide variety of different applications, such as in finance and commercial transactions, three-dimensional and real-time graphics, computer-aided design and manufacturing, healthcare, telecommunications, education, etc. Computers are finding new applications as their performance and speeds ever increase while costs decrease due to advances in hardware technology and rapid software development. Furthermore, a computer system's functionality and usefulness can be dramatically enhanced by coupling stand-alone computers together to form a computer network. In a computer network, users may readily exchange files, share information stored on a common database, pool resources, communicate via e-mail and even video teleconference.
One popular type of network setup is known as "client/server" computing. Basically, users perform tasks through their own dedicated desktop computer (i.e., the "client"). The desktop computer is networked to a larger, more powerful central computer (i.e., the "server"). The server acts as an intermediary between a group of clients and a database stored in a mass storage device. An assortment of network and database software enables communication between the various clients and the server. Hence, in a client/server arrangement, the data is easily maintained because it is stored in one location and maintained by the server; the data can be shared by a number of local or remote clients; the data is easily and quickly accessible; and clients may readily be added or removed.
Although client/server systems offer a great deal of flexibility and versatility, people are sometimes reluctant to use them because of their susceptibility to various types of failures. Furthermore, as computers take on more comprehensive and demanding tasks, the hardware and software become more complex and hence, the overall system becomes more prone to failures. A single server failure may detrimentally affect a large number of clients dependent on that particular server. In some mission critical applications, computer downtimes may have serious implications. For example, if a server fails in the middle of processing a financial application (e.g., payroll, securities, bank accounts, electronic money transfer, etc.), the consequences may be quite severe. Moreover, customer relations might be jeopardized (e.g., lost airline, car rental, or hotel reservations; delayed or mis-shipped orders; lost billing information; etc.).
Short of totally eliminating all failures which might disable the computer system, the goal is to ensure that data is not lost because of a failure. One prior art mechanism for accomplishing this goal is known as "checkpointing." Basically, checkpointing periodically updates the data stored in the database. Thereby, when the computer system becomes disabled, data can be recovered. FIG. 1 is a diagram describing a typical prior art computer system having checkpointing. The system may incorporate a number of clients 101-109 (e.g., personal computers, workstations, portable computers, minicomputers, terminals, etc.) which are serviced by one or more servers 110 and 111. Each of the clients interacts with server nodes 110 and 111 through various client programs, known as "processes", "workers", "threads", etc. Each of the server nodes 110 and 111 has its own dedicated main memory 113 and 114. Data from a large commonly shared storage mechanism, such as a disk array 112, is read into the main memories 113 and 114. Thereby, vast amounts of data stored in the form of relational databases residing in the disk array 112 are accessible to either of the servers 110 and 111 for distribution to the various clients 101-109. As data is changed by the users, the modified data is stored back into the main memories 113 and 114. The data is then marked to indicate that they have been changed. Periodically, the marked data is checkpointed back to the database residing in disk array 112. This involves writing all marked data to their corresponding locations in disk array 112. In addition, all changes made after the most recent checkpoint are recorded into a separate log file 115.
When one of the server nodes 110 or 111 crashes, it loses all data contained in its respective main memory. However, most of changes to the data have already been copied over to the database during the last checkpoint. The database is stored in the nonvolatile disk array 112. Hence, the data is not lost, even though power is unexpectedly terminated. Upon recovery, this data can be read from the database and stored back into the main memory. Furthermore, the most recent changes to the data made since the last checkpoint are read back from the log file 115 and made to the main memory.
Although checkpointing addresses the main problems of data recovery and preservation, it nevertheless, has several drawbacks. Namely, checkpointing is very costly to implement in terms of processing time. There is a severe performance penalty associated with checkpointing primarily because the marked records have to be written back to various disk locations in the database. These locations are usually scattered throughout different physical locations of the disk array 112. Often, thousands of transactions need to be updated during each check point. And each of these transactions typically require its own separate input/output (I/O) operation to gain access to the desired location. Furthermore, if the page to which the data is to be written back is not currently in the main memory, the page must first be read off the disk; the data must then be merged with that page; and the page must then be written back to the disk. This sequence of events requires two synchronous I/O operations. Thus, it is not uncommon for checkpointing to take upwards of half an hour or more to complete. In the meantime, the server is prevented from performing other functions while checkpointing is being processed.
As the above discussion illustrates, the prior art checkpointing method does not perform checkpointing in a wholly satisfactory manner. Therefore, a more efficient mechanism is needed to address the problems of data recovery and preservation.