This application relates to processing (including managing) and characterizing precomputed views in response to user queries on a database.
A database is a collection of data, usually pertaining to some reasonably well defined purpose. The data typically has a known format which is defined by metadata The data in a database is typically stored, retrieved, and modified by a database management system. A relational database management system stores information in tables, in rows and columns of data, and conducts searches. In a relational database, the rows of a table typically represent records (collections of information about separate items) and the columns typically represent fields (particular attributes of a record). A relational database management system may respond to user queries on the database by matching information from a field in one table with information in a corresponding field of another table, and producing a third table that combines data from both tables. For example, if one table contains the fields EMPLOYEE-ID, LAST-NAME, FIRST-NAME, DEPT-ID, SALARY and HIRE-DATE, and another table contains the fields DEPT-ID, DEPT-NAME, and LOCATION, a relational database management system may match the DEPT-ID fields in the two tables to find the names of all employees working in a department in a specified location.
Users typically query, update and manage a relational data base using a data sublanguage (e.g., SQL). A data sublanguage is one that may be used in association with another computer language for the specialized purpose of accessing data. There are many relational data sublanguages, including QUEL from Relational Technology, Inc. (ASK/Ingres), and RDML from Digital Equipment Corporation. SQL has been formally standardized for access to relational databases and is widely implemented and used, although there are many variations of SQL (e.g., RISQL(trademark) from Informix Software, Inc. of Menlo Park, Calif.).
In the relational database model, the basic unit of data is the relation. In SQL, the relation is represented by a table. A relation is typically made up of one or more attributes (represented as columns in SQL), each of which is associated with a data type (e.g., a character string, an integer, or a floating point number). Data typically is stored in a table in tuples (rows in SQL).
Referring to FIGS. 1A-1D, the relational database tables Product, Sales, Time and Result contain columns of attributes and rows of data related to those attributes. For example, the Product table of FIG. 1A, includes prod ID, product type, and bar code. Specific operations can be performed on these tables. One such operation is selection, which identifies a specific row or rows in a table. Selection is typically done by specifying one or more predicates that are used to filter a table to identify rows for which the predicate is true. Predicates are typically found in the WHERE clause of an SQL query. For example, a selection operation could request the selection of prod ID of 1, which would select the first row of the Product table. Another operation in the relational database model is called the join. A join operation is a way of combining data from two tables which is typically based on the relationships between the data in those tables. The Product table identifies the product type by prod ID, and the Sales table identifies the amount of dollars associated with each prod ID/time (time key) combination. The Product table and the Sales table may be joined through their prod ID columns. The Sales table also associates a time key with each row, and the Time table relates a day with each time key. Accordingly, the Sales table and the Time table may be joined through the time key values.
FIG. 2 shows relationships between a fact table and its dimension tables. Fact tables and dimension tables are a subset of detail tables. Fact tables are detail tables which record events (e.g., a sales event). The tables in which information related to the sales event (e.g., a time, a store, and a product) is stored are the dimension tables of the associated fact table. For example, Time table, Store table, and Product table are the dimension tables associated with the Sales table. The Class outboard (dimension of a dimension) table eliminates redundancies by repeated information (the relationships between products and classes) in a separate table. This feature is referred to as normalization.
Another concept in relational database models is functional dependency. A functional dependency is a many-to-one relationship between columns of values in database tables. A functional dependency from column x to column y is a constraint that requires two rows to have the same value for the y column if they have the same value for the x column. A functional dependency may be explicitly declared by a user, such as the database administrator.
Further, relational database models provide for an aggregation query, which is a query that requires the summarization or consolidation of rows in database tables, typically using a set function, such as SUM or COUNT, and an optional GROUP BY clause. An aggregate table is typically a table that summarizes or consolidates detail level records from other database tables.
SQL enables users to define a virtual table (a xe2x80x9cviewxe2x80x9d) and to save that definition in a database as metadata. A view usually is not physically materialized (or xe2x80x9cprecomputedxe2x80x9d) until it is needed (e.g., when a SQL statement references the view). The metadata about the view can include information such as the name of the view, the names and data types of each column and the SQL text describing the data the view produces. The metadata about the view is typically stored in the database""s metadata, but the actual data that the user will see in the view are typically not physically stored. Rather the data typically are stored in detail tables from which the view""s rows are derived. In the case of a precomputed view, data typically is stored in an associated precomputed table. In general, operations can be performed against views just as they can be performed against detail tables.
A user may request information such as how many units of cereal X were sold on Jan. 1, 1999. The result of that query may be derived from the Product table (FIG. 1A), the Sales table (FIG. 1B), and the Time table (FIG. 1C). However, deriving the answers to each question from various tables can be highly inefficient and time consuming. In order to increase efficiency, a database administrator may predict questions which are most likely to be asked and precompute a table which includes the answers to these likely questions prior to these questions actually being asked. For example, a store""s database administrator may determine that a frequently asked question is the total sum of sales of a given product over a period of one day (sum_dollars). Accordingly, the database administrator may create the Precomputed table (FIG. 1D) and a precomputed view associated with that table.
The database administrator typically can not be expected to anticipate all common questions likely to be asked and even if all queries were known it is not feasible to precompute/materialize all results. When a query is asked and the answer is not directly available from a precomputed table, the answer to such a query typically is derived from one or more detail tables, a process which may be highly time consuming and computationally intensive.
The invention features methods and apparatus for processing precomputed views for answering user queries on a database.
In accordance with one aspect, a user database query on precomputation strategy effectiveness is responded to by defining, based upon user database query history, an analysis space consisting of a subset of all possible views for the database, and by characterizing the views in the analysis space.
In accordance with another aspect, a structure is imposed on an analysis space consisting of a subset of all possible views for the database based upon the capabilities of a query rewriting facility, and the views in the analysis space are characterized.
In accordance with another aspect, an analysis space consisting of a subset of all possible views for the database is defined, and a cost formula is applied to the analysis space based upon a user-defined subset of the data contained in the database.
In accordance with another aspect, an analysis space that includes a candidate view composed of a combination of two or more constituent views is defined, and the views in the analysis space are characterized.
Embodiments may include one or more of the following features.
An analysis space may be defined, at least in part, by generating a log record for received user queries. The log record may be generated from query blocks parsed from user queries. The process of generating a log record for a user query may include a determination of whether the query was rewritten to use a precomputed view. The process of generating a log record may include a determination of whether rewritten user queries include aggregation. The process of generating a log record may include an identification of a candidate view that could be used to rewrite user queries. The log record generated for a query may include information about a precomputed view used to rewrite the query. The log record generated for a user query may include information about a candidate view that could be used to rewrite the query.
The analysis space may be defined based upon capabilities of a query rewriting facility. The definition of the analysis space may reflect whether or not user queries have been rewritten. The definition of the analysis space may reflect whether or not rewritten user queries include aggregation. The query rewrite facility may be configured to impose a structure on the analysis space. The imposed structure may include a graph representative of hierarchical relationships between views in the analysis space.
The analysis space may be defined based upon meta data. The analysis space may be defined based upon hierarchical relationships among views. The defined analysis space may include precomputed views. The defined analysis space may include candidate views generated based upon user query history.
The views in the analysis space may be characterized, at least in part, by generating a measure of precomputed view utilization. The views in the analysis space may be characterized, at least in part, by generating a measure of the benefit of a precomputed view relative to other views in the analysis space. The views in the analysis space may be characterized, at least in part, by generating a report that reflects current precomputation strategy effectiveness. The views in the analysis space may be characterized, at least in part, by generating a report that includes information about candidate views that could be created. The views in the analysis space may be characterized, at least in part, by generating a measure of the benefit of a candidate view relative to other views in the analysis space. The views in the analysis space may be characterized, at least in part, by applying a cost formula to the analysis space.
Structure may be imposed on the analysis space based upon meta data. The structure may be imposed on the analysis space based upon hierarchical relationships between views in the analysis space. At least one of the constituent views of the candidate view may be a candidate view. At least one of the constituent views of the candidate view may be a precomputed view. The defined analysis space may include the constituent views of the candidate view. The defined analysis space may not include the constituent views of the candidate view.
Among the advantages of the invention are the following.
The invention helps the database administrator with creating and evaluating the optimal set of precomputed aggregates to satisfy a system""s unique performance and space requirements. The invention provides an analysis of precomputation strategy effectiveness based upon user query history. The invention may perform the analysis on a user-defined subset of the detail data, reducing analysis time. The invention may refine the space of candidate views used in the analysis to consider views that subsume one or more candidate views in the subspace. The invention may define a structure on the analysis space that allows a cost-benefit analysis of a precomputed or candidate view to be determined in relation to other views in the space.
End-users and applications may continue to query the database as they always have and the query rewriting facility transforms the queries to utilize the existing aggregates. The invention enables a database administrator to tune the database""s aggregate performance without affecting the way queries are submitted. In addition, all aggregate-related metadata is integrated into the database system""s catalog, including intra-dimensional hierarchy relationships.
Other features and advantages will become apparent from the following description, including the drawings and the claims.