The use of hierarchical mark-up languages for structuring and describing data has found wide acceptance in the computer industry. An example of a mark-up language is the Extensible Mark-up Language (XML). Another example is JavaScript Object Notation (JSON).
Data structured using a hierarchical mark-up language is composed of nodes. Nodes are delimited by a set of delimiters that mark the nodes, and may be tagged with names, referred to herein as tag names. In general, the syntax of hierarchical mark-up languages specify that tag names are embedded, juxtaposed, or otherwise syntactically associated with delimiters that delimit nodes. There may be two kinds of nodes, leaf nodes and non-leaf nodes.
XML
For XML data, a node is delimited by start and end tags that comprise tag names. For example, in the following XML fragment,
<ZIPCODE>  <CODE>95125</CODE>   <CITY>SAN JOSE</CITY>  <STATE>CA</STATE>  </ZIPCODE>
the start tag <ZIP CODE> and the end tag </ZIP CODE> delimit a node having the name ZIP CODE.
FIG. 1A is a node tree representing the above XML fragment. Referring to FIG. 1A, it depicts node tree 101. Non-leaf nodes are depicted with double-line borders, while leaf nodes are depicted with single-line borders. In XML, a non-leaf node corresponds to an element node and a leaf node corresponds to a data node. The element nodes in the node tree are referred to herein by the node's name, which is the name of the element represented by a node. For convenience of exposition, the data nodes are referred to by the value the data nodes represent.
The data between the corresponding tags is referred to as a node's content. For a data node, the content can be a scalar value (e.g. integer, text string, date).
A non-leaf node, such as an element node, contains or more other nodes. For an element node, the content can be a data node and/or one or more element nodes.
ZIPCODE is an element node that contains child nodes CODE, CITY, and STATE, which are also element nodes. Data nodes 95125, SAN JOSE, and CA are data nodes for element nodes CODE, CITY, and STATE, respectively.
The nodes contained by a particular node are referred to herein as descendant nodes of the particular node. CODE, CITY, and STATE are descendant nodes of ZIPCODE. 95125 is a descendant node of CODE and ZIPCODE, SAN JOSE is a descendant node of CITY and ZIPCODE, and CA is a descendant node of STATE and ZIPCODE.
A non-leaf node thus forms a hierarchy of nodes with multiple levels, the non-leaf node being at the top level. A node at each level is linked to one or more nodes at a different level. Any given node at a level below the top level is a child node of a parent node at the level immediately above the given node. Nodes having the same parent are sibling nodes. A parent node may have multiple child nodes. A node that has no parent node linked to it is a root node. A node that has no child nodes is a leaf node. A node that has one or more descendant nodes is a non-leaf node.
For example, in non-leaf node ZIP CODE, node ZIP CODE is a root node at the top level. Nodes 95125, SAN JOSE, and CA are leaf nodes.
The term “hierarchical data object” is used herein to refer to a sequence of one or nodes, at least one of the nodes in the sequence being a non-leaf node having a child node An XML document is an example of a hierarchical data object. Another example is a JSON object.
JSON
JSON is a lightweight hierarchical mark-up language. A JSON object comprises a collection of fields, each of which is a field name/value pair. A field name is in effect a tag name for a node in a JSON object. The name of the field is separated by a colon from the field's value. A JSON value may be:
An object, which is a list of fields enclosed in braces “{ }” and separated within the braces by commas.
An array, which is a list of comma separated JSON values enclosed in square brackets “[ ]”.
An atom, which is a string, number, true, false, or null.
The following JSON object J is used to illustrate JSON.
{  “FIRSTNAME”: “JACK”,  “LASTNAME”: “SMITH”,  “ADDRESS”: {   “STREETADDRESS”: “101 FIRST STREET”,   “CITY”: “SAN JOSE”,   “STATE”: “CA”,   “POSTALCODE”: “95110”  },  “PHONENUMBERS”: [   “408 555-1234”,   “650 123-5555”  ]}
Object J contains fields FIRSTNAME, LASTNAME, ADDRESS, STREETADDRESS, CITY, STATE, POSTALCODE, and PHONENUMBERS. FIRSTNAME and LASTNAME have atom string values “JOHN” and “SMITH”, respectively. ADDRESS is an object containing member fields STREETADDRESS, CITY, STATE, and POSTALCODE, which have atom string values “101 FIRST STREET”, “SAN JOSE”, “CA”, “95110”, respectively. PHONENUMBERS is an array comprising atom values “408 555-1234” and “650 123-5555”.
Each field in a JSON object is a non-leaf node and the name of the non-leaf node is the field name. Each array and object is a non-leaf node. Data nodes correspond to an atom value.
FIG. 1B depicts JSON object J as hierarchical data object 101 comprising nodes as described below. Referring to FIG. 1B, there are four root nodes, which are FIRSTNAME, LASTNAME, ADDRESS, and PHONENUMBERS. Each of FIRSTNAME, LASTNAME, ADDRESS, and PHONENUMBERS is a field node. ADDRESS has a descendant object node. From the object node four descendant field nodes descend, which are STREETADDRESS, LASTNAME, STATE, and POSTALCODE.
Nodes FIRSTNAME, LASTNAME, STREETADDRESS, CITY, STATE, and POSTALCODE have descendant data nodes representing atom string values “JACK”, “SMITH”, “101 FIRST STREET”, “SAN JOSE”, “CA”, “95110”, respectively.
PHONENUMBERS has a descendant array node. The array node has two descendant data nodes representing atom string values “408-555-1234”, and “650-123-555”.
Schemas-Based Approaches
Efficient querying is critically important to accessing hierarchical data objects. Effective approaches for querying hierarchical data objects include schema-based approaches.
One schema-based approach is the schema-based relational-storage approach. In this approach, collections of hierarchical data objects (“collection members”) are stored as schema instances within tables of a database managed by a Database Management System (DBMS). This approach leverages the power of object-relational DBMS's to index and query data. In general, the schema-based relational-storage approach involves registering a schema with a DBMS, which generates tables and columns needed to store the attributes (e.g. elements, fields) defined by the schema.
Storing a collection of hierarchically marked-up documents or objects as instances of a schema may require developing a schema that defines many if not all attributes found in any member of a collection. Some or many of the attributes defined by the schema may only occur in a relatively small subset of the collection members. The number of attributes defined by a schema may be many times larger than the number of attributes of many collection members. Many attributes may be sparsely populated. Managing schemas with a relatively large number of attributes, some or many of which may be sparsely populated, can be burdensome to a DBMS and administrators and users of the DBMS.
Schema-Less Approaches
To avoid pitfalls of using schema-based approaches, schema-less approaches may be used. One schema-less approach is the partial projection approach. Under the partial projection approach, a set of commonly queried attributes of the collection are projected and copied into columns of additional tables; these tables exist to support DBMS indexing of the columns using, for example, binary tree or bit map indexing. The approach works most optimally when the query workload for the collection is known to follow a pattern, so that commonly queried attributes can be determined. The approach works less optimally when the workload is ad-hoc and the number of attributes to project cannot be easily restrained to a relatively small number. Also, many of the unprojected attributes must be searched using text search or functional evaluation against collection members.
Another schema-less approach is the inverted index approach. An inverted index is used to index values of a collection. The inverted index approach provides efficient ad-hoc querying based on key words.
Querying Based on Structural Features
When querying hierarchically marked-up data, it is important to be able to specify structural features of the data to return. Structural features of hierarchically marked-up data include element containment, field containment, and path-based and hierarchical relationships among nodes in hierarchically marked-up data. In general, schema-based approaches provide more efficient ad hoc querying based on structural features.
Described herein is a schema-less indexing approach for efficiently querying hierarchically marked-up data based on structural features of the hierarchically marked-up data.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualifies as prior art merely by virtue of their inclusion in this section.