Relational database systems have been used to store new types of data, like resource description framework (RDF) graphs of JavaScript Object Notation (JSON) documents. One of the key characteristics of these new data types is that they are fairly loosely structured and quite heterogeneous. Unlike relational data that all fit nicely into a pre-determined set of tables, these new data types have schemas that are not known a priori but change and evolve as time passes. There have been many relational representations developed that try to accommodate these new data types and deal with the dynamicity of schemas. For example, triple store relations have been developed for RDF data that store all triples into a single relation, and entity-types stores have been developed for the same purpose.
A key consideration when evaluating a relational representation for these new data types is performance during query evaluation. Existing representations have varying levels of success in achieving good performance across various query workloads. The most obvious and straight-forward way to attempt to improve the performance of problematic queries has been the creation of additional indexes over the underlying tables. Unfortunately, while the creation of indexes is a full-proof approach to improve the performance of queries over relational data, their success is questionable when these are created over the representations of unstructured data. Inherently, indexes work well in relational data since given a column all the indexed data in the column are of the same type.
Unfortunately, while representing these new data types in relational databases, it is seldom the case that a column stores data that are similar. More often than not, columns are overloaded to save space and simplify the schema design with multiple data types. Indexes then fail to efficiently speed-up queries since data retrieved and indexed in a particular column are often irrelevant to the query at hand. The situation is even worse when one considers that the most effective indexes in relational databases, and the ones that offer the most performance gains, are those in which the index key is unique for each record and the data on disk are clustered according to the indexed value. However, the new data types do not satisfy any of these requirements. The indexed values are seldom unique, and it is impossible to decide a single preferred clustering strategy for the index data. This is a result of the fact that the data can be accessed in a multitude of ways given the plurality of predicates available and the heterogeneity of data elements having these predicates. Therefore, indexing techniques are desired that can be used with the new types of data currently stored in relational stores to improve query performance.