Semi-structured data format, such as JSON and XML, is extensively used for storage of large volume data sets, particularly because the semi-structured data can be conveniently stored in one or more documents without losing the hierarchical information of the data. The hierarchical information of the data may include information relationships and arrangement of portions of data in reference to each other. For example, semi-structured data can be generally represented as a hierarchy of nodes, each node containing a portion of the data. In such a hierarchy, each node containing a portion of semi-structured data has hierarchical relationship with one or more other nodes in the hierarchy: the node may be a parent, a sibling or a child of another node in the semi-structured data.
Despite the convenience of storing large data sets of semi-structured data in large amount of documents, querying such large amounts of documents is challenging. In particular, a query may request data that matches query defined value constraints and is located in a particular level/branch/node of the hierarchy. To satisfy the query, each document matching the query criteria needs to be navigated according to the hierarchy to match for the data requested. Such operations may be very expensive, especially considering that thousands, if not hundreds of thousands, of documents may need to be traversed. The traversals of hierarchies are particularly costly operations, if the schemas of the hierarchies involved are unknown. To query unknown or “implicit” schema data, assumptions about the hierarchy of the semi-structured data must be made, and the resulting returned data is only correct, if the semi-structured data indeed conformed to the assumed hierarchy.
One solution is to normalize semi-structured data and store the data in relational tables of a relational database to leverage the powerful query capabilities of a database management system (DBMS). However, even when data is stored in a relational manner, the query performance is not optimal. For data sets that have one to many cardinality relationships, the data sets of semi-structured data are stored separately in a relational database to avoid repetition. For example, for customer data that contains addresses and telephone numbers of customers in addition to names, each customer name may correspond to multiple addresses as well as to multiple telephone numbers. Thus, if all the customer data was to be stored in a single table of a relational database in a DBMS, to preserve the relationship between customer names and addresses and telephone numbers, each customer name would be repeated for each address and then repeated again for each telephone number. Storing addresses and phone numbers in separate tables, with foreign key relationships to customer names in a separate table, avoids the repetition of the customer names. The term “normalization/normalizing” refers to the process of splitting data into data sets according to cardinality relationships and datatypes of the datasets.
While storage efficiency is improved by normalization of semi-structured data by avoiding repetition, query performance, yet improved compared to document storage, may still be disparaging. Queries that require data from multiple datasets need to join the datasets according to their respective relationships. For example, if a query is directed to select customer names along with addresses and phone numbers, then the query execution includes joining the customer name table with the customer address table and with the customer phone number table to produce the resulting data set. Such join operations may consume substantial resources especially when often executed and when the data set is large.
One solution is to produce a de-normalized “view” of data that is physically stored in a normalized manner. Instead of storing another copy of data in a de-normalized manner, the DBMS stores the pre-computed query that produces the de-normalized view. When a user query is received that references the view, the stored pre-computed query for the view is executed to produce the resulting data set for the view. Although the stored query is pre-computed, the DBMS still has to retrieve the data from the physical storage and arrange it together for the resulting data set. Therefore, even querying a view consumes more compute resources than a retrieval of data from a single table in a database.