Within relational database systems, data tends to be organized in a highly structured manner. Specifically, data is often stored in relational tables, where each row represents a set of related data (e.g. information about a particular person), and each column corresponds to a particular attribute (e.g. name, address, age, etc.). However, using relational tables to store and access data that has less structure is less straightforward.
It is becoming increasingly important to be able to store and efficiently access data that is less structured (referred to as “semi-structured data”). Semi-structured data, such as XML, JSON etc. are the primary data model to support schema-less development paradigm where users do not have to specify schema in order to store data and query data.
Because a schema is not required ahead of time to store semi-structured data, users have freedom to store whatever original forms of the semi-structured data on disk. Unfortunately, this freedom comes at the price of performance. Query performance over the semi-structured data is reduced because some of the original forms of the semi-structure data are very inefficient for query processing. For example, storing textual format of XML or JSON on disk is simple and convenient for users, but very inefficient to answer queries over textual storage due to the expense of text parsing.
An example of a semi-structured document is illustrated in FIG. 1. Referring to FIG. 1, the depicted semi-structured hierarchical data object 100 is a JSON document. While JSON is used as an example of semi-structured data, the techniques described herein are applicable to any form of semi-structured data. In FIG. 1, a JSON object may be represented by data that is enclosed by the curly braces “{“and”}”. Thus, a JSON object may be a person and/or a location described in FIG. 1.
Hierarchical data object 100 may include field-names that are associated with field values. In the example of FIG. 1, the field-names include “person”, “id”, “birthdate”, “friends”, “location”, “city”, and “zip”. For JSON objects, field-names may precede a colon in a name-value pair. In the example of FIG. 1, the field values include ‘123’, ‘john’, ‘1970-01-02’, ‘456’, ‘Mary’, ‘1968-04-03’, ‘789’, ‘Henry’, ‘1972-03-03’, ‘Oakland’, and ‘94403’. For JSON objects, field values may be anything other than a field-name or a grouping symbol that follows a colon in a name-value pair. Field values may include a null value, a Boolean value, a string value, a number value, a float value, a double precision value, a date value, a timestamp value, a timestamp with a time zone value, a year-month interval, a day-second interval, a signed binary integer, and/or any other data type. Each field-name may be associated with one or more field values in hierarchical data object 100. For example, “person” may be associated with ‘456’, ‘Mary’, and ‘1968-04-03’.
Caching semi-structured data in volatile memory yields some query performance improvement because the need to access persistent storage during query processing is reduced. For example, if the text of object 100 is loaded into volatile memory before receiving a query that targets object 100, then the query can execute faster simply because no additional disk I/Os are necessary. However, since what is pre-loaded into volatile memory still reflects the original storage form of the semi-structured data, query performance still suffers due to the need to parse or otherwise process the semi-structured data during query execution.
One approach to address the query processing inefficiencies that results from the format of the semi-structured data is to convert the semi-structured data into another format prior to storing the semi-structured data in the database. For example, various database system vendors have developed proprietary binary formats to use to store semi-structure data within their respective database systems. For example, Oracle, IBM, and Microsoft SQL server have required users to store XML data as XMLType, which internally translates the XML into a vendor specific binary format for vendor-specific efficient query processing over that format. Similarly, MongoDB requires users to store JSON data as BSON for efficient JSON query processing.
While the strategy of converting the semi-structured data into a new format works for semi-structured data that is to be stored inside the database and fully controlled by the corresponding database system, for data stored in external file system and mounted to database as external table, the database-specific binary storage format strategy does not work. Specifically, under these circumstances, the database system has no control of the data stored outside the database. The database system cannot pro-actively convert the data to a database-specific binary storage format because that would render the data unusable by whatever other external systems need to access the data.
Based on the foregoing, it is clearly desirable to have a technique to speed up query processing, by a database system, over semi-structured data whose primary storage format may not be fully controllable by the database system, particularly when the primary storage format is not efficient for query processing.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.