Application programs in a computer system typically need to manage data in a manner that permits frequent updating. Two broad examples of such application programs are a word processor and a database manager. Word processors need to be able to manipulate sections of text and other related information each time the user modifies a document, and a database program needs to insert, delete and modify entries in accordance with a user's requirements. Updating is an issue for computer manufacturers as well, for example to permit upgrades to operating system routines, including those which are provided in ROM.
The above two related patent applications set forth a variety of issues that often face software application developers. For example, as set forth in more detail in the above-mentioned STORAGE MANAGER FOR COMPUTER SYSTEM patent application, there is a trade-off between storage space and speed of execution. Reduction of the amount of wasted space in a database, for example, often detrimentally impacts the speed with which certain operations are performed, such as searching. Also as set forth in the STORAGE MANAGER FOR COMPUTER SYSTEM, for many types of application programs, the file structure offered by the operating system is not appropriate to the task. Since the smallest unit of information supported by the operating system typically is a file, and since file manipulation operating system calls are slow and inefficient for small pieces of data, many application programs tend to maintain their data in a proprietary format in only one or a few files each containing many small data items. The result is extensive duplication of effort to define and maintain such proprietary formats, efforts that could otherwise be directed toward enhanced functionality.
Also as set forth in the STORAGE MANAGER FOR COMPUTER SYSTEM patent application, software developers often face issues when data is stored in different parts of a data storage apparatus which have different protocols for access. There is a need in the industry to simplify the implementation of application programs by providing a common mechanism by which the application developer can access data regardless of how or where it is stored in the computer system's storage apparatus.
Many application program developers also face yet another issue if the data maintained by the program is intended to be accessible, and modifiable, by more than one user. As used herein, persistent storage of information refers to information which remains after an application program which references or creates it, terminates. Persistent storage is often nonvolatile in that it also survives shutdown of the computer system, but in some situations can be partially or completely volatile. In a word processor, it is often desirable to support the ability of two or more different users to update a single document at the same time. In a database system, it is often desirable to permit different users to update the database data concurrently. Most application programs implement a technique known as "pessimistic concurrency" which, while permitting many users to read and view the data concurrently, permits only one user to modify the data at a time. The system "locks out" all other users from write accesses when one user has the data open for updating.
Pessimistic concurrency can be implemented at a file level or, in sophisticated database programs for example, at a record level. That is, for file level locking, only one user may have the file open at a time for writing. This is the typical manner with which word processors implement concurrency. A database program can implement record level locking if, for example, a backend process is the only process which has the data file open for writing, and all other users issue their commands and queries through the backend process.
Some database programs have implemented "optimistic concurrency", in which two or more users can update data at the same time. To the extent the concurrent updates conflict, only one can successfully be committed. Optimistic concurrency is different from pessimistic concurrency in that users are permitted to make updates concurrently, subject to subsequent detection of conflicts and resulting inability to commit the updates.
These update techniques produce serializability of updates--the characteristic that for a set of committed updates to the data, at least one sequence exists in which those specific updates could have been performed to achieve the resulting state of the information. Optimistic concurrency permits increased performance in some circumstances, but still produces serializable committed updates.
A few programs implement an update model known as "Iversioning," which does challenge the requirement of serializable updates. In a versioning mechanism, several concurrent updaters can each have an independent yet internally consistent view of the information. The views are known as "configurations". The different updaters modify the information concurrently, and can write their updated configurations to persistent storage, all subject to subsequent reconciliation. One example of a program implementing a versioning update model is the Macintosh.RTM. Programming Workshop (MPW) Projector available from Apple Computer, Inc., Cupertino, Calif. MPW Projector is described in the MPW 3.1 Reference Manual, and in H. Kanner, "Projector, An Informal Tutorial", available from Apple Computer, Inc. (1989), incorporated herein by reference.
While MPW Projector is a good first step toward reducing the constraints imposed by strict serializability, significant additional flexibility is highly desirable. For example, Projector's finest level of granularity is still represented by a "file". It would be desirable to support much finer degrees of granularity. As another example, MPW Projector's provisions for reconciling two conflicting versions of a document is limited to a single procedure in which the computer identifies strict text differences, and a user indicates how each text difference should be resolved. Significant additional intelligence will be desirable in the comparison procedure, as would significant increased flexibility and automation in the resolution of conflicts, as well as support for comparisons between non-text information. Accordingly, there is a need for much greater flexibility in the support of non-serialized updates in the maintenance of data.
Increasingly, documents and other collections of stored information are made up of multiple content elements, such as text, tables, images, formatting information, mathematical equations and graphs. Often content is created using one application program and then included in documents created by other applications. Subsequently, content elements may be copied out of a document and used in yet another document, and so on.
In the past, different applications typically had no way to exchange multiple content elements, unless they had a "private contract" about the format to be used. Furthermore, one application typically had no way to find the content elements in another application's document, so typically it was not able to obtain content elements from the other application's documents even if it knew the format. Moreover, every application developer who wanted to store multiple content elements in a document typically had to develop a proprietary object storage mechanism.
The use of multiple content elements in a document implicates at least two difficult issues: where each element is located and what the format of the data is. Regarding the first of these issues, it would be desirable if the data in a particular element could be stored in memory, in a local persistent storage device, across the network, or even created dynamically, all in a manner which is transparent to the application program which is operating on element. In this way the limited resources available to application program developers can be directed toward enhancement of functionality rather than dealing with multiple types of storage devices.
Similarly, with regard to the second issue, it would be desirable if each different content element could have stored in association with it all of the routines which are needed to manipulate it, again, transparently to the application program. This, too, would free up developers' resources for more useful purposes.
In a general way, an individual developer might obtain some of the transparency described above by programming the application using an object-oriented programming language such as C++. Object-oriented rogramming is described in many references, including, for example, G. Booch, "Object-Oriented Design With Applications" (Benjamin/Cummings Publishing Company: 1991), incorporated herein by reference. While these languages can be used to address the problems described above for handling multiple content elements, it is not clear how that can be done. Certainly the languages themselves do not provide guidance on how they can be used for such purposes. For example, the inheritance mechanism in C++ is a compile-time mechanism.
The issues that software developers face regarding update techniques are worthy of additional attention. In particular, it would be desirable to address the following more specific updating issues. First, in the situation where an operating system (or other program) is provided at one time and an update is distributed subsequently, the update mechanism needs to have a way of locating specific portions of the base version to be changed. The problem is referred to herein as patching. One way this was handled in the past was to provide a series of patches, each of which identify a location in the base version, specify something about its expected existing contents (to improve confidence that the update is correct), and one or more operations to be performed at that location. Patches are very fragile and error prone, however. They are usually created by hand, thus requiring a person to generate them who knows the base operating system in extensive detail, often at the binary level. Patches will also fail, or even corrupt the user's system, if some aspect of the user's system configuration or prior update history was not taken into account when the patches were created. Patching can be performed also where the base version is provided in read-only memory, provided the system accesses such information using a level of indirection which is changeable. But this can be even more complicated than direct patches, and often creates additional problems of its own.
Accordingly, there is a need for an update mechanism which does not rely on the physical location of the base information which is to be updated.
Second, as mentioned above, it is often desirable to support the ability of two or more different users to update information at the same time. Most application programs implement pessimistic concurrency, while only a few implement optimistic concurrency. Still fewer implement versioning. As mentioned above, MPW projector is a good first step toward implementing versioning. MPW Projector is an integrated set of tools and scripts whose primary purpose is to maintain control of the development of source code. It preserves in an orderly manner the various revisions of a file, and through the versioning mechanism also prevents one programmer from inadvertently destroying changes made by another. If the underlying data is text, data compression is achieved by storing only one complete copy of a file and storing revisions only as files of differences. Different users of the same set of files can view them differently since each user is given independent control of the mapping between the user's local directory hierarchy, in which the user keeps the files, and the hierarchy used for their storage in the main Projector database. Projector also has a facility for associating a specific set of file revisions with a name, this name being usable as a designator for a particular version, or release, of a product. Thus the name alone can be used to trigger the selection of just those source files that are required to build the desired instance of the product.
MPW Projector maintains versions in a tree structure. When one user desires to modify a file in the main Projector database, the user "checks out" the file, thereby making a copy of the file in the user's own directory. The user can check out a file either as "read-only" or, if no one else has already done so, as "read/write". After modifying the file, the user can then "check in" the file back to the main Projector database, either as a new version in a new branch of the file version tree, or, only if the file was check out as read/write as a new version in the same branch of the version tree. When it is finally desirable to merge a branch of the revision tree back into the main trunk, MPW Projector performs a strict text-based comparison between the two versions of the file and displays the differences in a pair of windows on the computer system display. A user then cuts-and-pastes portions from one window into the other in order to merge them together.
As can be seen MPW projector's technique for reconciling different updates of a base version is useful only when the underlying data is text, and operates only by comparison of the two updates to be reconciled. No record is kept of the individual changes that were made from the base version to each of the updated versions. The reconciliation process for an implementation of version merging can be made significantly more intelligent if such a record was kept.
Third, inherent in the desirability to support multiple concurrent updaters, each updater should be provided with an independent "view" of the information. That is, if a work includes a plurality of modules, each updater should have his "current" view of the information include all of the modules which have not been changed in his or her current version, plus only his or her current version of the modules which have been changed. MPW projector accomplishes this, but only at the file level. Much finer granularity would be desirable. For example, if the information is a document, it would be desirable for the level of granularity to be as small as a paragraph, or even a sentence. If the information is in a database format, it would be desirable for the level of granularity to be possibly as small as a record or less.
Accordingly, it is desirable to have an update mechanism which supports fine degrees of granularity.
Fourth, when updates are made, the update mechanism should perform them atomically. That is, while the storage mechanism should be able to maintain partial updates in nonvolatile storage, so as to reduce memory requirements, the base document should always be recoverable in case of a system or power failure. Most application programs, such as word processors, open a temporary file in nonvolatile memory to store the partially edited version. Alternatively, in virtual memory machines, the partially edited version can remain in "virtual memory". When the user is ready to "commit" the updates, the application program copies any unedited portions of the base information into the temporary file so that the temporary file is complete, then renames the old base file to a second temporary name, then renames the first temporary file to the name of the prior name of the base file, and then deletes the old file. Thus there is never a time when a complete, internally consistent version of the information, does not exist in nonvolatile storage.
This technique becomes problematical, however, when the information file is extremely large. In particular, it can be seen that the technique requires twice as much available space in the nonvolatile storage medium as the ultimate file requires. In addition, the extensive amount of copying required by the mechanism can severely degrade performance.
Another conventional technique for handling atomic updates is sometimes referred to as the "shadow page technique". In this technique, the base file is divided into pages, and an index to the current version of the pages is maintained. Updates are managed atomically at the page level rather than the file level, and are accomplished by writing the new version of a page, then updating the index to point to the new version rather than the old version of that page, and then deleting the old version of the page. In some implementations, the index itself may also be shadowed. The set of old pages, identified by the old index, and the set of new pages identified by the new index, completely describe the base and updated states of the information, respectively.
The shadow page technique avoids the large file problem mentioned above, but can still be inadequate in many situations. For example, the minimum granularity of a page may still be too coarse. Additionally, like other update mechanisms described above, reconciliation of two or more concurrently created updates is difficult in the shadow page technique in part because no record is kept of the changes which were made from the base version to each update version to be reconciled.
The shadow page technique has yet another problem, which arises from the fact that different pages may end up at different places in the file. The indexing mechanism permits random placement of these pages. A consecutive read of the information in the file therefore can cause extensive jumping around inside the file, thereby degrading performance.
Yet another conventional update technique can combine shadow pages with the use of a log file. As changes are made in a temporary version of the information, for example in memory, the individual changes are recorded in a log which is written out to nonvolatile storage. Periodically, or at userspecified times, the current state of the information as represented in memory is written to persistent storage and a marker is written to the log file to indicate that persistent storage is consistent up to that point in the log file. The problem of randomly located pages is minimized in this technique since speed optimizations such as elevator algorithms can be used during the write.
There are many variations of log-enhanced update techniques, including some in which the change log identifies base pages logically rather than physically. An extreme example of log-enhanced updating techniques is set forth in Douglis and Osterhout, "Log Structured File Systems," Compcon '89 (1989), incorporated herein by reference. The Osterhout paper describes a log-structured file system which maintains only the change log in persistent storage. No complete set of the information exists in persistent storage. Rather, it is always merely reconstructed in memory, to the extent needed, by traversing the log file. Persistent storage also maintains a pointer to the last known valid change specified in the log file, and atomicity of updates is accomplished by redirecting that pointer to point to a subsequent position in the log file at the time of each "commit".
Neither the log-enhanced techniques nor the exclusively log-based technique, however, supports any concept of more than one valid consistent state of the information, and therefore does not support nonserialized concurrency well. They also fail to adequately handle situations like the operating system patch example set forth above, and do not permit different users to have independent views of the current state of the information being updated.
Accordingly, it is desirable to provide an update mechanism which can operate at fine granularity, support independent views of the information by multiple updaters, facilitate reconciliation of concurrent updates, and maintain atomicity of updates without requiring large amounts of free disk space or substantially degrading performance. Moreover, it would be desirable if the same update mechanism could be used for all different types of information, including operating systems, databases, documents and so on. Still further, it would be desirable for the update mechanism to be integrated with other container manager mechanisms intended to reduce storage space without degrading performance, handle fine granularity units of information, support a wide variety of different types of content elements, and unify the methods by which the application programs access different kinds of storage media.