Natural language text processed by computers is most commonly represented as a sequence of binary values each of which represents a character or symbol used in the visual representation of the language.
The widely used ASCII coding represents each of the commonly used letters, numbers and punctuation characters in English language text using 7-bit binary values which represent printable characters and control codes (“control characters”) such as linefeed (LF) and escape (ESC) characters. To provide adequate capacity for the much larger number of characters and symbols used by other languages, the Universal Character Set (UCS or ISO 10646) and Unicode character set (a standard promulgated by the Unicode Consortium), both of which define tens of thousands of characters, have been defined. These more robust codes typically encode characters and symbols into 16 bit codes and, as a consequence, require even more storage space per character.
Such character based coding schemes for representing natural language text data are notoriously inefficient. Thus, when large files of character text are to be stored or transmitted, text compression programs are frequently used to make better use of storage space or reduce the bandwidth needed for trans-mission. Using compression algorithms such as Huffman coding and Ziv-Lempel coding, it is possible to reduce the size of text character files to a small fraction of their original size. The compressed text must, of course, be decompressed before it can be processed or displayed for human consumption.
It is accordingly desirable to employ a mechanism for compressing text character data into a more compact form that need not be decompressed before it can be processed.
For efficient processing, non-character data, such as boolean values, integers, floating point numbers, logical values, and the like, are typically represented by typed data structures which can be efficiently manipulated by computing machines. Difficulties are encountered, however, when such typed data must be used outside the environment of the particular computer program that formats the data in the first instance. While a given program running on a given computer can efficiently store a rich mixture of typed data in one or more files, and can retrieve and process those files with great efficiency, it is often extraordinarily difficult for a different program running on the same computing machine to interpret and successfully process such a file of typed data. Moreover, it extremely difficult even for skilled humans to interpret the binary data in such a typed file without knowing the structure and data types used by the source program. The problem is made considerably more difficult when an attempt is made to process typed data created by one processor using a different processor having a different machine architecture, or to share data between like processors operating under the control of different operating systems.
Complex data are frequently stored and manipulated using a relational database management system. In such a system, information is organized into relational tables, each of which comprises a two-dimensional set of named columns and an arbitrary number of rows. All of the entries in each column are of the same data type and drawn from the same domain. For addressing efficiency, a fixed amount of space is typically reserved for the data stored in a given column, permitting the location of each colunm within a row to be predicted. When variable length data is to be placed in a given column (e.g. the characters in a city name), enough space is reserved to provide room for even the largest city names, with the result tat most of the reserved space is unused. It would accordingly be desirable to provide a storage and addressing mechanism whereby variable length data elements may be efficiently stored without wasted space yet rapidly located and processed by an addressing mechanism that does not require scanning to identify imbedded delimiters or the like.
While relational database systems are well suited for storing structured business data, many types of data, such as the information in hierarchical tree structures as well as the nested elements typically found in XML data, do not map well into relational tables. Object oriented databases which support inheritance have notable advantages over relational databases for storing hierarchical data. It would accordingly be desirable to employ a database architecture which can efficiently store and process information organized and stored in relational, hierarchical and object-oriented databases, and particularly to employ a database architecture which efficiently handles variable length character data.
Notwithstanding the inherent storage and addressing inefficiencies normally associated with variable length character-based data representations, character data is increasingly chosen for communications on the Internet between heterogeneous machines and different operating systems. For example, the File Transfer Protocol (FTP), the Telnet protocol, the Simple Mail Transfer Protocol (SMTP) and the Hypertext Transport Protocol (HTTP) used on the Internet all communicate using character data. Data representations, such as the Hypertext Markup Language (HTML) and the Extensible Markup Language (XML) are both variable-length, character based representations that have been widely adopted for sharing data among different computers via the Internet.
These character based markup languages have admirably served their designer's goals, as demonstrated by their widespread adoption. These goals did not, however, include at attempt to make the data representation compact. Indeed, as expressed in the XML specification, “terseness in XML markup is of minimal importance.” When markup tags and other character text are added to the original text data in order to express that data's structure and meaning, substantially more storage space is consumed.
Character based markup data is typically processed by parsing the data, character by character, to separate the data from the markup and to thereafter process the character data in accordance with the meaning given to it by the markup. In many applications, before XML character data is processed, its is parsed into a sequence of nodes each of which is represented in memory by a tree hierarchy of allocated node objects. The widely used Document Object Model (DOM) interface provides a universal and widely used structure for representing and operating on XML objects in the computer's memory. Unfortunately, the parsing of character data and the allocation of memory for the nodes of a DOM's object tree structure consumes substantial processing and memory resources, further exacerbating the coding inefficiency of the character based data and markup representation.