A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention pertains to electronic data processing, and more particularly concerns reducing the overhead associated with small databases, especially for computers having limited capacity.
Conventional relational databases such as Microsoft(copyright) Access(copyright) and SQL Server are flexible and powerful. However, they are large programs, and they are optimized for large databases, concurrent access by multiple users, and ease of modifying data.
One of the consequences of this optimization is that the performance overhead of each database is relatively high. In particular, conventional databases must store a schema that is developed anew by the database program for each database. A relational database is made up of one or (usually) more tables. Each table is a set of records or rows having data in a defined set of columns. The data in each column is defined to be of a certain type, and may also have value restrictions, such as uniqueness or not null. Indexes can be defined on certain table columns. This information about the database is its schema. Database programs employ a data definition language whereby a user can define and modify the schema of a database. Because the data definition language (DDL) is typically the only facility for manipulating the schema of a database, a user (or, more likely, a database administrator) must create every new database essentially by hand. Again, for large databases having multiple users, this is not a problem.
Another consequence is that the storage overhead of each database is high. Optimization for concurrent usage, especially concurrent updating, by many users imposes many restrictions upon the form of the data. Constant read and write operations make many compression techniques too time-consuming. Large numbers of write operations relative to read operations impose restrictions on reorganizing the data for storage efficiency.
An increasing range of applications, however, could advantageously employ the power of the relational model for a large number of smaller databases, especially those normally accessed only by single users who mostly read the data, and write new data infrequently. For example, component libraries containing class and type definitions for programming systems need to be widely distributed, and seldom have their data modified. As another example, address books in hand-held personal computers and similar applications are the antithesis of the databases for which relational databases are designed. These applications have many copies of similarly defined small, single-user, read-mostly databases.
Today, many such applications employ one-off database programs of very limited power and flexibility in order to allow the use of compression and other techniques for increasing efficiency. Consequently, there is a need for processing large numbers of relatively small data bases without incurring the storage penalties of conventional relational database management systems or the limitations of individually written database programs.
The present invention provides a data file format optimized for small, single-user, read-mostly databases. The file has a signature, a number of data streams, and a header identifying the data streams. Compressed relational tables are represented as fixed-width arrays. The invention also provides a table format having a fixed width for easy access to individual records. Any table having a designated primary key has its records ordered by the values of the primary key, so that a simple binary search can find any desired value. An additional non-persisted record number column may identify each record. Another non-data hash column can optionally facilitate hash chaining by containing the number of the next record in each hash chain, so that a hash vector need only contain the record number of the first record in each hash chain.