The present invention relates to querying data in a database and more specifically to a method of selecting appropriate query parameters to efficiently perform a query.
Databases commonly store data in one or more fact tables representing instance data of some nature, along with dimension tables defining attributes for data in the fact tables. As shown in FIG. 1, the fact tables have columns which each represent an element of a particular dimension associated with the instance in question, and one or more measure columns containing data relating to that particular instance. Often, the measures will be values which can be aggregated in some way if records in the fact data are grouped together. For example, the entries might be summed or averaged. However, this might not be the case, and "measure data" in certain measure columns in the fact tables might just be an arbitrary string or other type of value which cannot be aggregated. This invention operates on any type of fact table as long as it is related to one or more dimensions with attributes.
Dimensions represent a bounded set of entities within a class, each with one or more attributes which can be classified in a common manner. A dimension is normally represented in a dimension table with one or more or columns in the table uniquely indenting each entity in the dimension, known as the key value or values. Key values will often be arbitrary identifiers given to the entities in the dimension to uniquely identify them. Other columns in the table, possibly including one of the key columns, provide different attributes of each of the entities. These attribute values can be used to group the entries in the dimension table at different levels, and to extract or aggregate data from the fact tables according to their attributes.
For space efficiency, and to avoid redundancy that could lead to inconsistencies in the database, only the key values for each dimension are stored in the fact data tables. It should be noted that internal representations with a one to one correspondence with the key values in the dimension in question could be stored in the fact tables if more efficient. These would never be seen by the user. In such a case the internal representation can be considered the key value for the purposes of this discussion.
Query examples in this patent application will be given in the SQL language as this is by far the most prevalent query language used today. However, it will be apparent that the invention described herein can be implemented equally effectively with other query languages.
If a query is to be performed on the data in the database using the attribute values requested by the user, the dimension tables have to be included in the query, as in the following example:
SELECT fact.key1, fact.key2, fact.key3, fact.m1, fact.2 PA0 FROM fact, dim1, dim2, dim3 PA0 WHERE PA0 dim1.key1=fact.key1 AND PA0 dim2.key2=fact.key2 AND PA0 dim3.key3=fact.key3 AND PA0 dim1.attr1 in (attrval11, attrval12, . . . , attrval1N) AND PA0 dim2.attr2 in (attrval21, attrval22, . . . , attrval2P) AND PA0 dim3.attr3 in (attrval31, attrval32, . . . , attrval3Q); PA0 SELECT fact.key1, fact.key2, fact.key3, fact.m1, fact.2 PA0 FROM fact PA0 WHERE PA0 fact.key1 in (val11, val12, . . . , val1N) AND PA0 fact.key2 in (val21, val22, . . . , val2P) AND PA0 fact.key3 in (val31, val32, . . . , val3Q); PA0 SELECT fact.key1, fact.key2, fact.key3, fact.m1, fact.2 PA0 FROM fact, dim1, dim2, dim3 PA0 WHERE PA0 dim1.key1=fact.key1 AND PA0 dim2.key2=fact.key2 AND PA0 dim3.key3=fact.key3 AND PA0 fact.key1 in (val11, val12, . . . , val1N) AND PA0 fact.key2 in (val21, val22, . . . , val2P) AND PA0 fact.key3 in (val31, val32, . . . , val3Q);
In this type of query, the tables specified in the query are joined together on any common fields in each of the tables. In the above example, the key1 field is common to both the dim1 table and the fact table. The key2 field is common to both the dim2 table and the fact table. The key3 field is common to both the dim3 table and the fact table.
Using such a join, an entry is created in the output table for all combinations of entries in both of the two joined tables where the joined field is the same in each table. The joined field only appears once in the output joined table. For example, joining the two tables shown in FIGS. 2A and 2B results in the output table shown in FIG. 2C.
However, if an initial mapping of the attribute values selected in each dimension is made onto the key values in that dimension, it is not necessary to join the dimension tables with the fact table in the query, as all the necessary information is in the fact table. For example, the query might read as follows:
This will often be much more efficient than the equivalent attribute logic query, depending on how the database handles the query, because the dimension tables do not need to be included in the query. Furthermore, a database engine might be able to handle key values much more efficiently than attributes due to various optimizations.
Databases might be optimized for example by indexing the fact table by key value in each dimension. The appropriate fact data to include in the resultant data set can then be very quickly found by scanning through the entries for each key in the index, as the indexes associated with a particular key are arranged consecutively.
If such an index based optimizing scheme is used by the database, the following type of query will often be more advantageous to use, even when querying on key values:
Regardless of what query logic is used by current database querying tools, they all use the same transformations from the request into the appropriate SQL query, regardless of the number of entries which will be searched for using the query logic selected. For example, in some situations, selections made, especially at higher levels in dimensions, result in a large number of records from a dimension table matching selection criteria. For example, one might ask for all the stock funds of all the mutual funds on the market. In fact, there might be hundreds of dimension entries matching the selection criterion. If the database querying tool is automatically set up to transform the selected dimension attribute entries into key values in the fund dimension, the dimensions would be searched to find the key values matching the selection criteria. The entries in question would then be added to the query using an "IN" list, as shown above. A problem arises when the number of key values gets very large. Most database systems impose a limit on the number of values in a single "IN" list, and the query tool therefore has to split the query into a number of queries, searched for in the database. Furthermore, even if the dimensions are not indexed, it might actually take longer to perform a query with a significant number of key values using only the fact table than to perform the query using the few attribute values queried and introducing the dimensions into the query. For example, it might be almost as quick for a database engine to look up an attribute value the associated dimension table and check to see if it matches as it would to see if a key value matches. If there are significantly more key values in the equivalent query than the number of attribute values it would almost certainly be quicker to look up the attribute values in the associated dimension, than to check all the key values to see if one of them matches.
It can be seen that a query tool is required that can choose between alternative query structures depending on the actual attributes queried.