The present invention relates to the field of computer database software used for the storage and retrieval of information, and more particularly to an adaptive multi-dimensional database capable of storing and retrieving information of any type and format to and from both persistent and non-persistent storage.
For nearly as long as computers have been used for the calculation of results, they have been used for the storage and retrieval of information. This task is one for which computers are well suited; the structure of the computing hardware itself (specifically, a processor controlling persistent and non-persistent storage) provides an excellent platform for the storage and retrieval of information.
Current database technologies are typically characterized by one or the other of two predominant data storage methodologies. The first of these methodologies is known generally as xe2x80x9crelationalxe2x80x9d storage. While there are many characteristics of relational databases, perhaps the most significant is the requirement that every piece of information stored must be of a predetermined length. At the time the file is constructed, the length of each data field to be stored per record is determined, and all records added from that point forward must adhere to those restrictions on a field-by-field basis. While this methodology is certainly pragmatic, it provides several opportunities for improvement. First, if a field is defined to be x in length, then exactly x characters must be stored there. If information exceeding xcharacters must be stored, that information must be divided among multiple fields, disassembled at the time of storage, and reassembled at the time of retrieval. Such manipulation provides no practical benefit, other than to overcome an inherent weakness in the technology. On the other hand, if less than x characters are to be stored, storage space is wasted as the information is padded out with a predefined neutral character in order to fit the x character minimum for the field.
Another characteristic of a relational database is that it is inherently two-dimensional. A relational database is essentially a table organized into columns and rows, which provides a single data element at the intersection of each column and row. While this is an easily understood storage model, it is highly restrictive. If multiple values are required at each intersection, the database designer has two options: either 1) add new columns, or 2) add new rows for each of the multiple values. Neither option is optimal. If a new column is added, each row must then also contain that new column, regardless of whether or not multiple values exist for that row, since the size of the record is fixed and must be known prior to allocating the record. If, on the other hand, a new row is added for each multiple value, each row must then store duplicate information to maintain the relationships. In either case, storage is unnecessarily allocated, resulting in inefficient storage use.
To illustrate this, consider a relational database file containing parent and child names. For each parent, the file supports the storage of one child, such as the following:
The file structure presents a problem if a parent has more than one child. Using the relational model, the database designer has one of two options; either 1) add new columns for each child, or 2) repeat the parent information for each child. If the designer opts to add new columns, a number of columns to add must then be determined. However, this also presents a problem. If columns are defined, for example, for up to ten children, the file will not fully accommodate information for parents with more than ten children, and records for those who have fewer than ten children will still require the same amount of storage. If, on the other hand, the parent information is repeated by adding more rows, storage is wasted for each duplicated parent value. Obviously, neither option provides a complete solution.
The other predominant data storage methodology is known generally as xe2x80x9cMultivaluexe2x80x9d storage. Multivalue database systems (formerly known as Pick(copyright)-compatible systems; named after Richard Pick, the commonly accepted founder of the Multivalue technology) overcome the weaknesses inherent in the relational storage model. First, information stored in a Multivalue file is dynamicxe2x80x94that is, each record grows and/or shrinks based on the information to be stored. Unlike a relational file, which requires each record to be discretely defined at the time of file creation, a Multivalue file has no such restrictions. Instead, a file can be created, fields of any length can be added to records and textual records of any length or structure can be added to the file at any time.
Also unlike the relational methodology, the Multivalue methodology allows data to be multivaluedxe2x80x94that is, multiple values can be stored at each intersection of column and row. Additionally, each value in a multivalued field can contain any number of subvalues, thus allowing the construction of a three-dimensional record of fields (more commonly known as attributes) containing multivalues, each multivalue potentially containing multiple subvalues.
Using the parent/child example from above, this information could be stored using the Multivalue methodology with much less overhead than with the relational methodology. Records stored in a Multivalue file might appear something like this:
Joe Smith{circumflex over ( )}Sally Smith
Bob Thomas{circumflex over ( )}Jim Thomas]Jack Thomas
Fields in a Multivalue record have no specific starting and ending positions, nor specific length, as do their relational counterparts. Instead, the record contains certain characters that are used to separate, or delimit, each field. In the above example, the caret represents an attribute mark, which separates individual fields in the record. In the second example, the bracket character represents a value mark, which separates the individual multivalues in the field. Though not shown in this example, a subvalue mark could also be used to further divide each multivalued field.
Unlike the relational methodology, which stores information in memory and on persistent storage using virtually identical structures, the Multivalue methodology uses hashing and framing techniques when organizing the information on persistent storage. Essentially, each Multivalue file is divided into a series of groups, each group comprising any number of frames, or areas of persistent storage. In order for a record to be written to a particular group, a primary key is hashed (used in a calculation) to determine the appropriate group where the record should be stored. This particular combination of techniques is very effective in providing quick access to any record in the file, with certain limitations, discussed below.
While the Multivalue storage and retrieval methodology has advantages over the relational method, it is also problematic. First and foremost, because certain characters are used to delimit the attributes, values, and subvalues in a record, these characters cannot be contained in the data itself without compromising the structure of the record. Second, because there are no predefined field widths (as there would be with the relational model), there is no way to calculate the position of a given field in the record. Therefore, to extract a field from a record, the record must be scanned from the top, counting delimiters until the desired field is reached. This, therefore, causes the performance at the bottom of the record to be degraded in comparison to the performance at the top of the record. As the record grows, the degradation becomes more significant.
Additionally, while framing and hashing work effectively to provide quick access to records in the file, all known implementations of the Multivalue methodology force a frame to be a certain length, such as 512, 1K, 2K, or 4K. This introduces an inefficiency that is common to relational databasesxe2x80x94potentially significant excess storage can be required to fill a frame to maintain frame alignment in persistent storage.
Perhaps the most significant shortcoming applies to both methodologies. Both relational and Multivalue methodologies are designed for the storage of text and numbers, typically those in the ASCII character set. While implementations of both methodologies provide ways of accessing non-textual information (such as graphics or audio), neither methodology directly supports the storage of these types of highly dynamic and variant data forms inside of a xe2x80x98normalxe2x80x99 record.
In addition, due to the increase of text based computing, many applications now require that computers be able to recognize and manipulate text in different languages. UNICODE is a unified character encoding system for handling any type of international character that is maintained and developed by the UNICODE Consortium, and which is identical to the International Standards Organization""s (ISO) Basic Multilingual Plane (BMP) of ISO 10646. Unlike the 8-bit ASCII character set, UNICODE provides a unified 16-bit encoding scheme which allows systems to exchange information unambiguously. In addition, many applications operate on non-textual data such as audio or video data, thus making it easier for application designers to create applications that are multi-language aware.
Although UNICODE may be used to solve many of the problems of storing multi-lingual characters, there are some applications in which it is desirable to store information of varying type. For example, many software companies internationalize their software; thus, they must support installations in multiple countries. In this scenario, the company may wish to store the customer""s address both in English (using standard ASCII code) and in the customer""s local language (for example, using UNICODE). However, to support multiple character types, today""s database software must allocate enough memory to store the largest character type (e.g., 2 bytes for UNICODE). Thus, if the data is stored using a character type that requires less space than the largest character type (e.g., 1 byte for ASCII), memory space is unnecessarily wasted. Accordingly, a need exists for a database technology that allows any character or data type to be stored while still achieving optimal memory usage.
The present invention is a novel adaptive multidimensional database methodology that significantly improves over the prior art.
Just as the Multivalue methodology solves many of the concerns with the relational methodology, the invention solves the concerns with the Multivalue methodology. Rather than limit a record to merely two dimensions as in relational databases or to three dimensions as in Multivalue databases, the invention provides a methodology whereby a structure of unlimited dimensions can be constructed, maintained, and utilized. Additionally, there are no restrictions as to the type of information stored in each dimension of the invention""s record. While textual and numeric values can certainly be stored, the invention can also support audio, graphic, and any other type of information without compromising the n-dimensional structure.
This ability to store literally any type or structure of information means that the invention inherently supports a type of textual information which is of increasing value in the global internet communityxe2x80x94international character sets. The invention, unlike any existing data storage methodology, can store information encoded in any number of different character sets all within the same record.
Also, while the fundamental design of the invention""s persistent storage algorithm is rooted in Multivalue concepts, the invention provides additional features. Unlike its Multivalue roots, the invention provides user-defined variable length frames, which overcome the problem of wasting persistent storage simply to xe2x80x98fill spacexe2x80x99. In addition, the invention provides multiple hashing algorithms to allow more control over the distributions of records in persistent storage. Additionally, an automatic gap consolidation feature of the invention provides a methodology for reusing areas of the file where records have previously been written and deleted. The technical advantages of this invention therefore extends all of the functionality of both the relational and Multivalue methodologies, without the problems inherent to either.