In a relational DBMS, data is contained in one or more relations, where a relation comprises a relation schema and a relation instance. The relation instance is a table and the relation schema describes properties of the table. Specifically, the relation schema may specify a name of the relation, a name of each field (or column), and a domain of each field, where the domain is a data type and/or a set of values. The relation instance is a set of tuples with each tuple having fields meeting the specifications in the relation schema.
For example, suppose the database includes a table called “person”. The relation schema for the table is                personSchema:={(id; number), (name, varchar2(50)), (weight; number), (height; number), (shoesize; number), (creditscore; number)},        
where the first item in each pair of the set is the name of the field and the second item is the domain of the field. Thus, the table, shown below, is the set of tuples with values for the fields whose names and domains were specified by the relation schema.
TABLE 1idnameweightheightshoesizecreditscoreid1name1weight1height1shoesize1creditscore1id2name2weight2height2shoesize2creditscore2
Databases often store a relation instance or table as logical collections of tuples or records, where each collection is a data block in a data storage structure, such as a file. There are several ways to store the tuples in a data block. One way is “row” format, in which the values for the fields of a particular tuple are stored sequentially within an address space of the data block. Another way is “columnar” format, in which values in a particular column for all of the tuples in a data block are stored sequentially in the address space of the data block.
Sparse columns in tables are columns having few non-NULL entries. Such tables create inefficiencies both in storing the table and in accessing the table to answer a query. The kind of inefficiency depends on the format in which the table is stored. For example, if a table with sparse columns is stored in row format, the format wastes storage space because the records have columns filled with mostly NULL values. In addition, in a particular implementation, the row format may place a limit on the number of columns permitted for a table, thus preventing the table from representing all the data. If a table with sparse columns is stored in columnar format, the records waste less storage, but require more computation to retrieve and collect together the several columns for a particular row.
Approaches to address the limitations and inefficiencies of storing and accessing tables with sparse columns have been implemented on the user side of a DBMS. These user-side approaches include a flex-field approach and a name-value pair approach. In the flex-field approach, a set of spare columns is overloaded with different attributes for different rows. For example, a spare column X may hold a “weight” attribute for one row and a “shoesize” attribute for a different row. A database application maintains logic to map the attributes to the physical spare columns depending on the particular row and to generate a proper database statement to query the correct spare columns when a query involves a sparse column. In addition, spare columns of different built-in data types, such as number, varchar, data, timestamp, interval, are needed because data for sparse columns may have different types. In this approach, the main drawback is writing queries for common attributes and different queries for sparse attributes.
In the name-value pair approach, each non-NULL column of a tuple in a table is stored as a row of a table with three columns, (1) row id, (2) attribute name, and (3) attribute value, as shown below.
row_idattribute_nameattribute_valuerow1attr1val1row1attr2val2row1attr3val3row1attr4val4row2attr1val1row2attr2val2row2attr5val5row2attr6val6row3attr1val31row3attr6val36row3attr7val7row3attr8val8
Querying relations in this approach require a potentially large number of self-joins. For example, a multi-column-value search query, such as the one below, needs to return rows that satisfy the predicate in the WHERE clause, which requires two self-joins of the table, one for attr1=‘val1’ and one for attr2=‘val2’.
SELECT row_idFROM triple t1, triple t2WHEREt1.row_id = t2.row_id ANDt1.name = ‘attr1’ AND t1.value = ‘val1’ ANDt2.name = ‘attr2’ AND t2.value = ‘val2’
In general, the number of self-joins is equal to the number of column-value search pairs in the query. A query with many of these self-joins causes the query execution time to increase substantially.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.