For nearly as long as computers have been used for the calculation of results, they have been used for the storage and retrieval of information. This task is one for which computers are well suited; the structure of the computing hardware itself (specifically, a processor controlling persistent and non-persistent storage) provides an excellent platform for the storage and retrieval of information.
Current database technologies are typically characterized by one or the other of two predominant data storage methodologies. The first of these methodologies is known generally as “relational” storage. While there are many characteristics of relational databases, perhaps the most significant is the requirement that every piece of information stored must be of a predetermined length. At the time the file is constructed, the length of each data field to be stored per record is determined, and all records added from that point forward must adhere to those restrictions on a field-by-field basis. While this methodology is certainly pragmatic, it provides several opportunities for improvement. First, if a field is defined to be x in length, then exactly x characters must be stored there. If information exceeding x characters must be stored, that information must be divided among multiple fields, disassembled at the time of storage, and reassembled at the time of retrieval. Such manipulation provides no practical benefit, other than to overcome an inherent weakness in the technology. On the other hand, if less than x characters are to be stored, storage space is wasted as the information is padded out with a predefined neutral character in order to fit the x character minimum for the field.
Another characteristic of a relational database is that it is inherently two-dimensional. A relational database is essentially a table organized into columns and rows, which provides a single data element at the intersection of each column and row. While this is an easily understood storage model, it is highly restrictive. If multiple values are required at each intersection, the database designer has two options: either 1) add new columns, or 2) add new rows for each of the multiple values. Neither option is optimal. If a new column is added, each row must then also contain that new column, regardless of whether or not multiple values exist for that row, since the size of the record is fixed and must be known prior to allocating the record. If, on the other hand, a new row is added for each multiple value, each row must then store duplicate information to maintain the relationships. In either case, storage is unnecessarily allocated, resulting in inefficient storage use.
To illustrate this, consider a relational database file containing parent and child names. For each parent, the file supports the storage of one child, such as the following:
ParentChildJoe SmithSally SmithBob ThomasJim Thomas
The file structure presents a problem if a parent has more than one child. Using the relational model, the database designer has one of two options; either 1) add new columns for each child, or 2) repeat the parent information for each child. If the designer opts to add new columns, a number of columns to add must then be determined. However, this also presents a problem. If columns are defined, for example, for up to ten children, the file will not fully accommodate information for parents with more than ten children, and records for those who have fewer than ten children will still require the same amount of storage. If, on the other hand, the parent information is repeated by adding more rows, storage is wasted for each duplicated parent value. Obviously, neither option provides a complete solution.
The other predominant data storage methodology is known generally as “Multivalue” storage. Multivalue database systems (formerly known as Pick©-compatible systems; named after Richard Pick, the commonly accepted founder of the Multivalue technology) overcome the weaknesses inherent in the relational storage model. First, information stored in a Multivalue file is dynamic—that is, each record grows and/or shrinks based on the information to be stored. Unlike a relational file, which requires each record to be discretely defined at the time of file creation, a Multivalue file has no such restrictions. Instead, a file can be created, fields of any length can be added to records and textual records of any length or structure can be added to the file at any time.
Also unlike the relational methodology, the Multivalue methodology allows data to be multivalued—that is, multiple values can be stored at each intersection of column and row. Additionally, each value in a multivalued field can contain any number of subvalues, thus allowing the construction of a three-dimensional record of fields (more commonly known as attributes) containing multivalues, each multivalue potentially containing multiple subvalues.
Using the parent/child example from above, this information could be stored using the Multivalue methodology with much less overhead than with the relational methodology. Records stored in a Multivalue file might appear something like this:                Joe Smith^Sally Smith        Bob Thomas^Jim Thomas]Jack Thomas        
Fields in a Multivalue record have no specific starting and ending positions, nor specific length, as do their relational counterparts. Instead, the record contains certain characters that are used to separate, or delimit, each field. In the above example, the caret represents an attribute mark, which separates individual fields in the record. In the second example, the bracket character represents a value mark, which separates the individual multivalues in the field. Though not shown in this example, a subvalue mark could also be used to further divide each multivalued field.
Unlike the relational methodology, which stores information in memory and on persistent storage using virtually identical structures, the Multivalue methodology uses hashing and framing techniques when organizing the information on persistent storage. Essentially, each Multivalue file is divided into a series of groups, each group comprising any number of frames, or areas of persistent storage. In order for a record to be written to a particular group, a primary key is hashed (used in a calculation) to determine the appropriate group where the record should be stored. This particular combination of techniques is very effective in providing quick access to any record in the file, with certain limitations, discussed below.
While the Multivalue storage and retrieval methodology has advantages over the relational method, it is also problematic. First and foremost, because certain characters are used to delimit the attributes, values, and subvalues in a record, these characters cannot be contained in the data itself without compromising the structure of the record. Second, because there are no predefined field widths (as there would be with the relational model), there is no way to calculate the position of a given field in the record. Therefore, to extract a field from a record, the record must be scanned from the top, counting delimiters until the desired field is reached. This, therefore, causes the performance at the bottom of the record to be degraded in comparison to the performance at the top of the record. As the record grows, the degradation becomes more significant.
Additionally, while framing and hashing work effectively to provide quick access to records in the file, all known implementations of the Multivalue methodology force a frame to be a certain length, such as 512, 1K, 2K, or 4K. This introduces an inefficiency that is common to relational databases—potentially significant excess storage can be required to fill a frame to maintain frame alignment in persistent storage.
Perhaps the most significant shortcoming applies to both methodologies. Both relational and Multivalue methodologies are designed for the storage of text and numbers, typically those in the ASCII character set. While implementations of both methodologies provide ways of accessing non-textual information (such as graphics or audio), neither methodology directly supports the storage of these types of highly dynamic and variant data forms inside of a ‘normal’ record.
In addition, due to the increase of text based computing, many applications now require that computers be able to recognize and manipulate text in different languages. UNICODE is a unified character encoding system for handling any type of international character that is maintained and developed by the UNICODE Consortium, and which is identical to the International Standards Organization's (ISO) Basic Multilingual Plane (BMP) of ISO 10646. Unlike the 8-bit ASCII character set, UNICODE provides a unified 16-bit encoding scheme which allows systems to exchange information unambiguously. In addition, many applications operate on non-textual data such as audio or video data, thus making it easier for application designers to create applications that are multi-language aware.
Although UNICODE may be used to solve many of the problems of storing multi-lingual characters, there are some applications in which it is desirable to store information of varying type. For example, many software companies internationalize their software; thus, they must support installations in multiple countries. In this scenario, the company may wish to store the customer's address both in English (using standard ASCII code) and in the customer's local language (for example, using UNICODE). However, to support multiple character types, today's database software must allocate enough memory to store the largest character type (e.g., 2 bytes for UNICODE). Thus, if the data is stored using a character type that requires less space than the largest character type (e.g., 1 byte for ASCII), memory space is unnecessarily wasted. Accordingly, a need exists for a database technology that allows any character or data type to be stored while still achieving optimal memory usage.