The present invention relates to answering relational database queries. In particular, the present invention provides a system and method which relates to answering relational database queries by rewriting the query using hierarchical relationships between data.
A database is a collection of data, usually pertaining to some reasonably well defined purpose. In order to say that data has been stored in a database, as opposed to just being stored, certain conditions are typically satisfied. The data typically has a known format which is defined by metadata. Metadata is generally understood as data about data. The data in a database is typically stored, retrieved, and modified by a special type of computer program.
This special type of computer program is typically a database management system. A database management system (DBMS) is a special computer program built to store data into, retrieve data from, and modify data stored in a database.
A relational database management system is a type of database management system that stores information in tables, in rows and columns of data, and conducts searches. In a relational database, the rows of a table typically represent records (collections of information about separate items) and the columns typically represent fields (particular attributes of a record). In conducting searches, a relational database may match information from a field in one table with information in a corresponding field of another table to produce a third table that combines requested data from both tables. For example, if one table contains the fields EMPLOYEE-ID, LAST-NAME, FIRST-NAME, DEPT-ID, SALARY and HIRE-DATE, and another table contains the fields DEPT-ID, DEPT-NAME, and LOCATION, a relational database can match the DEPT-ID fields in the two tables to find such information as the names of all employees working in a department in a specified location. A relational database may use matching values in two tables to relate information in one to information in the other.
When querying a relational data base, a data sublanguage known as SQL is commonly used. SQL is a data sublanguage used in querying, updating, and managing relational databases. A data sublanguage is one that may be used in association with another computer language for the specialized purpose of accessing data. Examples of languages which SQL can be associated with include COBOL, Fortran, and C. In most cases, a computer application has a mixture of requirements. It is usually helpful if the application writer can build portions of the application by using a language well suited for that portion of the application, and reverting to a specialized data language only for those parts of the application that require it. In this way, the data language is often viewed as a data sublanguage with respect to the primary programming language, or host language. Although SQL is not a programming language in the same sense as C or Pascal, SQL can either be used in formulating interactive queries or be embedded in an application as instructions for handling data. The SQL standard also contains components for defining, altering, controlling, and securing data. SQL is designed for both technical and nontechnical users.
There are many relational data sublanguages. Examples of these relational data sublanguages include QUEL from Relational Technology, Inc. (ASK/Ingres), and RDML from Digital Equipment Corporation. However, SQL has been formally standardized for access to relational databases and is widely implemented and used. There are many variations of SQL.
One example of such a variation is RISQL(trademark) from Red Brick Systems.
In the relational database model, the basic unit of data is typically considered the relation. In SQL, the relation is represented by a table, with a primary difference between a relation and a table being that a relation typically does not have duplicate rows while a table can. A relation is typically made up of one or more attributes. These attributes are represented in column form in SQL. Each column has associated with it a data type. A data type is a characteristic of a piece of data, such as a character string, an integer, or a floating point number. Data is typically stored in a table in tuples, referred to in SQL as rows. As a result, a SQL table contains columns identifying the individual data elements, typically by name, and rows representing records of related data.
FIGS. 1A-1D show examples of relational database tables with attributes written in columns and data related to those attributes written in rows. For example, in FIG. 1A, the attributes include product ID, product type, and bar code. FIGS. 1B-1D show similar examples of tables related to sales, time, and results.
Specific operations can be performed on these tables, rows, and on individual data elements. One such operation is selection, which identifies a specific row or rows in a table. Selection is typically done by specifying one or more predicates that are used to filter a table to identify rows for which the predicate is true. Predicates are typically found in the xe2x80x9cwherexe2x80x9d clause of an SQL query. For example, a selection operation could request the selection of product ID equal to 1. In the example shown in FIG. 1, this selection would select the first row of FIG. 1A with product ID xe2x80x9c1xe2x80x9d.
Another operation in the relational database model is called the join. A join operation is a way of combining data from two tables which is typically based on the relationships between the data in those tables. For example, the result in FIG. 1D is an example of join operations between the xe2x80x9cproductxe2x80x9d table of FIG. 1A, the xe2x80x9csalesxe2x80x9d table of FIG. 1B, and the xe2x80x9ctimexe2x80x9d table of FIG. 1C. The product table of FIG. 1A identifies the product type by product ID, while the sales table identifies amount in dollars related to the product ID and a time key. The product table and the sales table are joined based on the product ID column. The sales table also associates a time key with each row, while the time table identifies a day with each time key. Accordingly, the sales table and the time table are joined based on the values of the time key. Note that the result in FIG. 1D is only a subset of possible result columns. Only an example of columns which have been selected is shown in FIG. 1D.
FIG. 2 shows relationships between a fact table and its dimension tables. Fact tables and dimension tables are a subset of base tables. Base tables are database tables, while fact tables are base tables which record events, such as a sales event. A sales event may include information related to a time, a store, and a product. The tables from which this related information is stored are the fact table""s dimension tables. For example, a time table, a store table, and a product table are the sales table""s dimension tables.
In the example shown in FIG. 2, the fact table is a sales table and the dimension tables are a time table, a product table, and a store table. For instance, the sales fact table may include the total sales at a particular store, on a particular date, within a particular time frame, of a particular product. Placing repeated information in a separate table is a type of normalization.
Normalization eliminates redundancy. This class table in FIG. 2 is a result of normalization, and eliminates redundancy in the product table.
Another concept in relational database models is functional dependency. A functional dependency is a many-to-one relationship between columns of values in database tables. A functional dependency from column x to column y is a constraint that requires two rows to have the same value for the y column if they have the same value for the x column. A functional dependency may be explicitly declared by a user, such as the database administrator.
Further, relational database models provide for an aggregation query, which is a query that requires the summarization or consolidation of rows in database tables, typically using a set function, such as SUM or COUNT, and an optional GROUP BY clause. An aggregate table is typically a table that summarizes or consolidates detail level records from other database tables.
SQL also provides the ability to define a virtual table and save that definition in a database as metadata, with a user-defined name. The object formed by this operation is called a view. A view is an example of a virtual table. It is generally not physically materialized anywhere until it is needed, for instance when a SQL statement references it. The metadata about the view can include information such as the name of the view, the names and data types of each column and the SQL text describing the data the view produces. The metadata about the view is typically stored in the database""s metadata, but the actual data that the user will see in the view are typically not physically stored. Rather the data are typically stored in base tables from which the view""s rows are derived. In the case of a materialized view, data is typically stored in an associated table. In general, operations can be performed against views just as they can be performed against base tables.
By using a sublanguage, such as SQL, a user can request information such as how many units of cereal X were sold on Jan. 1, 1998. The result to that query can be derived from the product table shown in FIG. 1A, the sales table shown in FIG. 1B, and the time table shown in FIG. 1C. However, deriving the answers to each question from various tables can be highly inefficient and time consuming. In order to increase efficiency, a database administrator can predict questions which are most likely to be asked and precompute a table which includes the answers to these likely questions prior to these questions actually being asked. For example, a store""s database administrator may determine that a frequently asked question is the total sum of sales of a given product over a period of one day. Accordingly, the database administrator can create a xe2x80x9cmaterialized viewxe2x80x9d which is associated with a precomputed table, such as the result table shown in FIG. 1D.
However, a database administrator typically can not be expected to anticipate all common questions likely to be asked and even if all queries were known it is not feasible to precompute/materialize all results. When a query is asked and the answer is not directly available from a precomputed table, then conventionally, the answer to such a query typically needs to be derived from a combination of the base tables. The derivation from base tables can be highly time consuming and can require a substantial amount of processing resources.
What is needed is a system and method for quickly and efficiently deriving an answer to a relational database query using materialized views where the materialized view definition may or may not exactly match the query. The present invention addressed such a need.
For further background information regarding relational databases and SQL, see Understanding the New SQL: A Complete Guide, Jim Melton and Alan R. Simon, Morgan Kaufmann Publishers, San Francisco, Calif., 1993.
A system and method for answering a relational database query is presented. According to an embodiment of the present invention, a database query is received. It is then determined whether that query can be rewritten in such a manner as to be able to utilize a materialized view, herein referred to as a precomputed view, such that an answer to the rewritten query is equivalent to an answer to the original query. If the query can be rewritten, it is determined whether the rewritten query can more efficiently derive the answer than the original query. The query which has the most efficiently derived answer is the query which is utilized to derive that answer.
According to an embodiment of the present invention, in rewriting the original query, predetermined hierarchies of data can be utilized. These predetermined hierarchies are treated as functional dependencies. These predetermined hierarchies allow the deliverance of multi-dimensional database (MDDB) functionality, such as rollups from aggregates on points in a dimension to other points of coarser granularity in the same dimension.
A method according to an embodiment of the present invention for answering a database query is presented. The method comprising steps of providing a database query; determining if the query should be rewritten; and rewriting the database query by utilizing a predetermined hierarchy of data, the rewriting being performed if it is determined that the query should be rewritten.
In another aspect of the invention, a system according to an embodiment of the present invention for answering a database query is also presented. The system comprises a means for providing a database query; a means for determining if the query should be rewritten; and a means for rewriting the database query by utilizing a predetermined hierarchy of data. The rewriting of the database query is performed if it is determined that the query should be rewritten.