1. Field of the Invention
The present invention relates to a system and method for storing large data files on a computer or computer network. More specifically, the present invention is related to the formatting of large data files to promote efficient data storage and transmission.
2. Statement of the Prior Art
Data is typically maintained for storage and retrieval in computer file systems, wherein a file comprises a collection of data or a set of records. A file system provides a collection of files and file management structures on a physical or logical storage device such as a disk or other memory device. A file system stores data in files, and provides an application programming interface (API) to facilitate accessing data stored on a storage medium such as a disk or other memory medium. A file system API provides various functions that are invoked by an application program to access data. Application programs control the internal format of a file and determine which data to store in which files. A file system typically allows files to be grouped into directories. Each directory may contain many files and sub-directories. A file system that groups files into directories and sub-directories is referred to as a hierarchical file system.
There is a continuing need to improve the access and control over file systems storing large quantities of variable-sized data records used in a large variety of applications. Applications involving accessing and controlling large quantities of stored data are found in the public sector, E-commerce, financial/insurance industry, travel industry, publishing industry, graphic arts industry, advertising industry and any other industry which requires managing large data files.
Examples where large amounts of data are stored in files in a hierarchical file system include database, logistics, and enterprise solutions software used by the financial, health and distribution industries, among others. Database, logistics and enterprise solutions software include an API to access large quantities of data.
In another example, computer-aided design (CAD) drawings prepared by architects, engineers, designers, planners, and the like require large amounts of data to be stored in files in a hierarchical file system. CAD software includes an API to access the large quantities of data. Applications such as, e.g., MicroStation® products, which are developed by Bentley Systems, Inc., Exton, Pa. U.S.A., and AutoCAD® products, which are developed by Autodesk, Inc., San Rafael, Calif., U.S.A. are typical of such CAD software, which may be used in the Engineering, Construction, and Operations (ECO) marketplace. U.S. Pat. No. 6,063,128 provides an example of a CAD system.
A typical CAD project employed in the engineering context is stored in numerous files. Each file typically contains one or more engineering models, each of which represents an engineering domain (e.g., structural, electrical, mechanical, plumbing). Moreover, each engineering model requires numerous items represented by a series of elements to support the complex and precise nature of each design. In this context, the term “element” is used to mean a record containing a variable number of bytes of data arranged in a format that can be interpreted by a program. The term “element” differs from the common notion of an “object” in that each element can have a variable number of bytes, whereas the size of an object is typically defined by it's class. It is the variable-sized nature of elements that causes their persistent storage to be complicated, because they cannot be written in fixed sized records and arranged in tables, as is typically done in relational databases, for example.
Each item in a model is represented by at least one element or an aggregation of elements. For example, a structural drawing can hold the column and beam layout for a floor plan, which are internally represented by lines, squares and rectangles and additional properties. In this example, an individual beam may be a collection of lines, squares and rectangle elements. The structure of the floor plan may be more complex and require many nested levels of elements to accurately provide a structural representation.
Accordingly, as the complexity of the project increases, the size of the CAD files also increases. As a result, CAD files become very large and efficient access and control of these large files is important. Conventionally, there are two approaches to storing these large data files.
In the first approach, the elements are stored and accessed as a sequential list, each element having a fixed header containing the element's size. Storing data in this manner requires that the file be read sequentially from the beginning to the end. Typically, a program will read the elements from the file into memory and, at the same time, also store the “file position” of each element in memory.
This approach is well suited for the common scenario where a large number of elements are read from the disk, while only a small minority of them are modified during a single editing session. In this case, modified elements can often be rewritten to the file by a simple seek-and-write to the appropriate file position on an element-by-element basis. Unfortunately, this only works for elements whose contents change, but whose size in bytes remains the same or becomes smaller. When elements become larger during an editing session, they must be deleted from their original file position and moved to the end of the file. This tends to leave “holes” (deleted elements occupying file space) in the file that can only be removed by rewriting the entire file. Further, the size of disk file can grow quite large, because it is not possible to remove deleted entries from the file without rewriting the entire file, and invalidating all in-memory element positions.
The second approach is to apply a compression algorithm to the element data before it is written to the file. This can often result in substantial savings in the resultant file size, because many applications have element data typically containing a great deal of redundancy. However, with this approach, element data cannot be saved incrementally, because a change to a single element can result in an entirely different compressed file.
Another consideration for the file storage approach is the typical requirement to allow multiple users to simultaneously access models when collaborating on a project. Typically, a first user creates an original model, which multiple users may view and/or edit depending on the level of access granted to the user. Since communication between users typically occurs over a computer network, the CAD system must ensure that changes to the model or items in the model are properly coordinated and the models are kept in a consistent state at all times. It is understood that a computer network refers to any type of computer network including but not limited to a local area network, wide area network (e.g. Intranet), and the Internet. The Internet includes but is not limited to the World Wide Web.
Since computer systems used in many industries (such as ECO) demand efficient use of network resources and have the further requirement that the file system must perform in a multi-user environment, existing file storage approaches require substantial improvement in order to enhance the efficiency of the file system to support the increasing data storage requirements. Therefore, there is a continuing need for an enhanced file storage approach, which efficiently accesses and controls large quantities of data in a single user and multi-user environment. Moreover, there is a continuing need for an enhanced file format permitting access and control to large quantities of data to improve the efficient transfer and storage of large quantities of data.