Computing systems typically include volatile memory associated with a central processor, and persistent memory such as rotating magnetic disks or tape, for storing data. When the operating system restarts, the contents of volatile memory are lost, but the persistent memory should retain the data stored therein even when the computer system does not operate continuously.
The ability to persistently store data is useless if the data cannot be reliably retrieved at a later time. But even persistently stored data can become irretrievable due to failure in the power source, the physical hardware, or software. Such failures typically result in the loss of data that was intended to be persistently stored.
Successfully processing and managing persistent data in a computing system involves considerations of several issues including the performance of programs that process persistent data, the durability of changes to persistent data, the ability to make atomic updates to complex data structures, and the complexity of persistent data processing programs.
Disk storage devices typically store and retrieve data a vector of fixed-size blocks (e.g., 512 8-bit bytes), where a block is itself a vector of bits. Such disk storage devices are connected to computing systems through an input/output (I/O) adaptor that permits data to be bilaterally transferred between the disk device and the central processor's volatile memory. This reading and writing of disk blocks occurs under control of the central processor. The disk storage devices are addressed through bitfiles, or vectors of uniquely identified bitfile pages, where each bitfile page is a vector of disk blocks. Individual bitfile pages are accessed via a memory cache that is referenced by bitfile page, as opposed to disk block.
Consider how a computer system modifies data permanently stored on a magnetic disk. Since the central processor can only act directly on data within its processor memory, the following steps must occur to modify even a single bit of persistently stored data:
(1) A copy of the data block containing the bit to be modified is read into central processor volatile memory, a step that typically takes about 10 ms to 20 ms; PA1 (2) The central processor modifies the copy of the bit in its volatile memory, this step typically requiring 1 .mu.s or so. At this time, the disk data processing code regards the modification as having been "logically" written to the disk device; PA1 (3) The copy of the disk block in the central processor volatile memory including the now modified bit is written back to the persistent disk storage device. As in step (1), this action will typically take 10 ms to 20 ms. PA1 (1) A file creating program initially calls a first function that creates a directory in persistent storage; PA1 (2) The file creating program then calls a second function that creates a name in the file directory portion of persistent storage; PA1 (3) The file creating program then calls a third function that allocates storage space in persistent storage for the named new file.
It is important to realize that until the modified data block is successfully written to the disk device in step (3), the modified bit is not persistently stored. If power failure or hardware or system failure prevents completion of step (3), the modified data will not be persistently stored. Note too that it takes thousands of times longer to persistently modify a single bit than to modify a bit in the central processor's volatile memory. Thus, performance is a major design concern for any program that manipulates persistent storage.
To minimize the slow access time associated with disk data, persistent data processing programs commonly use memory caching. After initially being read from the disk device, the data is cached in the central processor volatile memory for some length of time. Such caching avoids reads from the disk device when the desired data is already in the cache. Also, such caching avoids some writes by first making more than one change to data within a disk block before writing the block back to the disk device (also known as "deferred write-back caching"). Unfortunately the deferred write-back caching technique makes it impossible for an application program to know when changes to disk data are in fact persistently stored on disk.
Although of extreme importance, the durability of persistent data can be diminished by hardware failure and by the presence of deferred write-back caching.
A hardware failure (e.g., damage to the magnetic media) can prevent the disk device itself from successfully reading stored data, effectively rendering the stored data irretrievable. Often data stored on one disk device will be associated with data stored on a second disk device. The inability to access a single portion of stored data (for whatever reason) may prevent logical access to the remaining data, even though such data remains physically available on its storage device. The dependency of persistent data durability upon hardware reliability can be improved by replicating the data on several storage devices. In this manner, the probability of simultaneously losing all replicates of the data is very low, and durability is enhanced. "Shadowing" (or "mirroring") is a straightforward replication technique several disk devices are configured as exact replicas of each other. Thus, each disk block is stored in the same location, and in exactly the same format, on each disk device. Logging is a more complex form of replication, wherein a copy of modified disk data is written to a separate log file. The log file data is used in restoring disk data that cannot be retrieved from its normal disk device. The log may store an exact copy of changed disk data ("physical logging"), or may store information from which the disk data can be regenerated ("operation logging").
Durability can also be diminished through the use of write-back caching because a current copy of modified disk data exists only in volatile central processor memory until written to the disk device. A hardware or software fault abnormally shutting down the operating system at this time typically erases the memory cache contents. This erasure occurs even though the disk data processing code regards the modified data has having been logically written to the disk device. Logging can help solve the durability problem of write-back caching while retaining most of its performance benefits. Modifications to data in the buffer cache are logged so that the modifications can be restored if the modified buffer contents are lost before writing to disk. Another solution to improving durability is, of course, to not to use such caching. Alternatively "careful writing" or "write through" caching techniques can be used, wherein algorithms wait at certain steps until persistent storage changes are completely written from central processor volatile memory to the disk storage device. However such programs cannot be used with deferred write-back caching, and cannot benefit from the time-performance gains that write-back caching provides.
While durability of modifications to persistent data can be enhanced by replication, the recovery of data from the replica when a fault occurs can be more complicated. Data recovery is relatively straightforward when shadowing is used because an exact duplicate of the data is already available. But data recover from a log file is more complex and highly dependent on the particular logging technique used. Applications essentially must be custom designed to use logging techniques. Data recovery using a careful writing technique is simpler, but is limited to a small class of persistent data structure designs.
In addition to performance and durability considerations, the processing and management of persistent data must take into account the "atomicity" of persistent data updates. "Atomicity" refers to the "all or none" manner in which disk storage devices write the data contained in a disk block. This necessity to write all of the data to disk or to write none of it ensures that different fields in a data structure within a block can be modified consistently in a single action. By "consistently" it is meant that the modified data in volatile storage will coincide with the modified data as actually stored in persistent memory.
Atomicity is especially important where persistently stored data is modified or updated in a multi-step fashion often in different system locations, as required by a user program, for example. Consider for example, the steps required for the creation of a new file on persistent storage:
Updates to the directory creating, file name creating and storage allocation creating mechanisms must be linked together according to the file creating system's data structure. All changes resulting from the multi-step process must be made together, or not at all. Conventionally, higher level functions in a user program manage and control any side effects of lower level functions involving modification or updating of persistent storage. This supervisory ability must be operative even if an error or fault occurs during a multi-step, multiple storage location modification or update. Thus, during modification or storage of data in persistent storage, including multi-step, multiple location changes, there is an interrelationship between the various functions that complicates the design of the functions. In addition, user programs involving modification or updating of persistent storage often have functions dealing directly with the log file (e.g., log writing and log trimming, or purging log file data when it is no longer needed). The necessity to directly deal with a log file further complicates the design of the user program functions. A complex system structure that can successfully manage multi-step updates involving multiple locations may be decomposed into a number of sub-functions, or "agents", that know how to manage certain types of transactions required for the overall procedure. Alternatively, and less desirably, each top level function in the system must have imbedded therein knowledge of all possible changes and side effects.
The ability to atomically update a persistent data structure can create a problem when the data structure extends beyond a single disk block. In such instance, it is possible that not all disk blocks involved in a single update will actually be written. Where singly linked lists are involved, the persistent data structures may be designed so only a single atomic write is needed to logically change the state of the data structure. A logging technique can also achieve the same result in that an atomic write of a log record determines whether a given set of changes logically occurs or not.
Not surprisingly, persistent data processing programs that provide good performance, durability and the ability to make atomic updates can readily become overly complex, especially when the persistent data structures they manage become large and complex. Traditionally such programs must deal with recovery from system and storage faults in a program specific manner. Frequently the mechanisms available to solve different problems are not integrated with each other. For example, certain logging techniques must be coordinated with the writing of modified data from the buffer cache, such as the write-ahead logging protocol. Unfortunately, the inability to control the operating system buffer cache means the data processing program must forego those benefits, or implement its own buffer cache, thus adding complexity to the solution.
One method of reducing complexity is to subdivide large programs into smaller functions that implement only a part of the overall system. However an appropriate mechanism is necessary for these independently specified parts to work together to implement a complex function with the desired atomicity, durability and performance characteristics. Distributed transaction processing models address some of these considerations by describing how independently specified systems can be used to implement complex functions atomically and durably. However, such models have complicated interfaces that require significant central processor time for implementation. Further, many persistent storage processing details, such as buffer management and fault recovery, are typically beyond the scope of the transaction management mechanisms providing the framework for the entire system. Known transaction processing systems that provide a useful conceptual structure and mechanisms addressing the above-described factors are typically complex to use and demand significant additional processing overhead, making them inappropriate for a large class of persistent storage processing systems.
In summary, there is a need for a method of providing a simple, efficient and high performance mechanism to make complex changes to the persistent storage of data within a computer system. Such method should permit data changes to persist durably, despite hardware and software faults, and should not impose significant overhead penalties in its implementation. The present invention provides such a method.