Application programs in a computer system typically need to manage data in a manner that permits frequent updating. Two broad examples of such application programs are a word processor and a database manager. Word processors need to be able to manipulate sections of text and other related information each time the user modifies a document, and a database program needs to insert, delete and modify entries in accordance with a user's requirements.
One issue that often faces developers of these types of application programs is the trade-off between storage space and speed of execution. For example, database programs typically manage data in the form of one or more tables. Each record in a table has the same number of fields, each of which are in the same format, and all of which are described in a separate data structure which the program maintains in conjunction with the table. In such a table, each row might represent a different record, and each column might represent a different field within the record. If the table is maintained as a simple array, then each field is assigned a predetermined maximum length, and storage is allocated for the maximum length of each field in each record. This clearly results in much wasted storage space, since the data in most fields will not occupy the full amount of space allotted. The developer can save space by maintaining variable length fields as merely fixed-byte length offsets into an area of storage which contains the actual data, but this entails a level of indirection which must be invoked every time it is necessary to access the particular data. This can detrimentally impact the speed with which certain operations are performed, such as searching.
Another space utilization issue which database program developers often face when data is stored and maintained as tables, is that very often it is desirable to include a particular field in only one or a few of the records in the table, such field being unnecessary in the vast majority of the records. A large amount of unused space must be allocated to maintain such a field in all of the records, even if the developer seeks to minimize the wasted space through indirection. The database program developer can reduce the amount of wasted space by maintaining the data as a linked list rather than as an array, but again, only with the penalty of extensive additional overhead for operations such as searching. Additionally, linked list implementations often do not save very much space since some storage must still be allocated to indicate that a particular field is empty in a particular record. The developer may be able to reduce the speed penalty by adding cross-pointers in the linked list, but this technique again increases the amount of storage space used.
The trade-off between storage space usage and speed of access becomes more severe as the data being managed, if expressed as an array, becomes more sparse. Accordingly, there is a need for a method of managing data which minimizes both the usage of space and the time required to access the data.
Another issue faced by application program developers is that for many types of application programs, the file structure offered by the operating system is not appropriate to the task. Typical of the data Storage Managers offered by operating systems are those offered by MS-DOS.RTM., Unix.RTM., and by the Apple Macintosh.RTM.. All of these operating systems store data in "files". A file is maintained in a "directory" of files, and directories may be maintained as parts of other directories, thereby creating a tree structure for the files. If the storage apparatus managed by the operating system contains more than one storage medium, such as different hard disks, floppy disks, CD-ROMS, remote storage media accessible via a network, or local volatile memory, then each such medium usually has its own root directory within which all of the files stored on the medium exist. Unix.RTM. and Macintosh.RTM. also support aliasing of files, whereby a file may appear in several different directories, although only one instance contains the data of the file. All the other instances merely refer to the one real instance.
In these file systems, the smallest unit of information supported by the operating system is a file for many of the frequently needed operations. Since the speed penalty involved in operating system calls to open and close files is significant, application programs tend to maintain data in only one or a few files rather than attempt to take advantage of the file system structure supported by the operating system. For example, a database program developer may be able to avoid a large amount of data movement when a record is inserted or deleted, merely by maintaining each record in a separate file. As another example, a database application program may wish to maintain each field of a record in a separate file, thereby inherently implementing variable length fields. Neither of these techniques is practical, however, since they would require enormous numbers of operating system calls to open and close files, thereby imposing a substantial speed penalty.
Since many of the application programs maintain their data in only one or a few files, each such program requires the development and implementation of a proprietary data format which allows the application to quickly store and retrieve the data which the particular application program expects. Developers therefore often maintain large libraries of code for accessing their own proprietary file formats. One example is the MacWrite program, which maintains its own mechanism for moving data to and from memory. The mechanism is optimized for the particular file format used, and is not directly useable by other application programs. Other application programs have essentially similar mechanisms. The result is an immense duplication of effort that could otherwise be directed toward enhanced user functionality.
Accordingly, there is a significant need for operating system support of data storage in a form which is useful to a wide variety of application programs.
Another issue which application developers often face arises when data is stored in different parts of a data storage apparatus, which have different protocols for access. For example, storage apparatus in a computer system may include not only persistent storage such as a hard disk, but also volatile storage such as the computer system's main memory. That is, if an application program wishes to minimize the number of reads and writes to a hard disk, it may maintain some of the data in main memory for as long as possible before the space it occupies in main memory becomes necessary for another purpose. One frequent example is a word processor's need to maintain some portion of a document currently being edited in memory, and other portions of the document out on disk. Such a technique, known as caching, often requires the application program to keep track of which data is currently on which medium, and use the appropriate protocol for accessing that medium. For example, if the data is on disk, the application program typically uses the operating system's read and write calls to access the data. If the data is in main memory, then the application program can avoid all the overhead of the operating system merely by addressing the particular memory locations at which the data may be found. If the data is in ROM, yet a third data access protocol may be necessary. There is a need in the industry to simplify the implementation of application programs by providing a common mechanism by which the application developer can access data regardless of how or where it is stored in the computer system's storage apparatus.
Many application program developers also face yet another issue if the data maintained by the program is intended to be accessible, and modifiable, by more than one user. The term "shared structured storage" can be defined as a mechanism for making data persistent across sessions (invocations) of an application program, with the data being available for collaborative updating. For example, in a word processor, it is often desirable to support the ability of two or more different users to update a single document at the same time. In a database system, it is often desirable to permit different users to update the database data concurrently. Most application programs implement a technique known as "pessimistic concurrency" which, while permitting many users to read and view the data concurrently, permits only one user to modify the data at a time. The system "locks out" all other users from write accesses when one user has the data open for updating. Pessimistic concurrency can be implemented at a file level or, in sophisticated database programs for example, at a record level. That is, for file level locking, only one user may have the file open at a time for writing. This is the typical manner with which word processors implement concurrency. A database program can implement record level locking if, for example, a backend process is the only process which has the data file open for writing, and all other users issue their commands and queries through the backend process.
Some application programs have attempted to implement "optimistic concurrency", in which two or more users can modify data at the same time, subject to subsequent reconciliation. One example is the Macintosh.RTM. Programming Workshop (MPW) Projector available from Apple Computer, Inc., Cupertino, Calif. MPW Projector is described in the MPW 3.1 Reference Manual, and in H. Kanner, "Projector, An Informal Tutorial", available from Apple Computer, Inc. (1989). MPW Projector is an integrated set of tools and scripts whose primary purpose is to maintain control of the development of source code. It preserves in an orderly manner the various revisions of a file, and through the versioning mechanism also prevents one programmer from inadvertently destroying changes made by another. If the underlying data is text, data compression is achieved by storing only one complete copy of a file and storing revisions only as files of differences. Different users of the same set of files can view them differently since each user is given independent control of the mapping between the user's local directory hierarchy, in which the user keeps the files, and the hierarchy used for their storage in the main Projector database. Projector also has a facility for associating a specific set of file revisions with a name, this name being usable as a designator for a particular version, or release, of a product. Thus the name alone can be used to trigger the selection of just those source files that are required to build the desired instance of the product.
MPW Projector maintains versions in a tree structure. When one user desires to modify a file in the main Projector database, the user "checks out" the file, thereby making a copy of the file in the user's own directory. The user can check out a file either as "read-only" or, if no one else has already done so, as "read/write". After modifying the file, the user can then "check in" the file back to the main Projector database, either as a new version in a new branch of the file version tree, or, only if the file was check out as read/write as a new version in the same branch of the version tree. When it is finally desirable to merge a branch of the revision tree back into the main trunk, MPW Projector performs a strict text-based comparison between the two versions of the file and displays the differences in a pair of windows on the computer system display. A user then cuts-and-pastes portions from one window into the other in order to merge them together.
While MPW Projector is a good first step toward optimistic concurrency, significant additional flexibility is highly desirable. For example, its finest level of granularity is still represented by a "file". It would be desirable to support much greater degrees of granularity. As another example, MPW Projector's provisions for merging two versions of a single document together is limited to a single procedure in which the computer identifies strict text differences, and a user indicates how each text difference should be resolved. Significant additional intelligence will be desirable in the comparison procedure, as would significant increased flexibility and automation in the resolution of conflicts, as well as support for comparisons between non-text files.
Some developers of application programs have attempted to use the Resource Manager, available from Apple Computer, to implement structured storage of data. The Resource Manager is described in Apple Computer, "Inside Macintosh: Overview", Chap. 3 (1992), incorporated herein by reference. The Resource Manager does not support concurrent updating of its data, however, and in any event was not designed for this purpose. The Resource Manager therefore fails to provide an adequate solution.
Accordingly, there is a need for much greater flexibility in the support of optimistic concurrency in the maintenance of data.