The present invention relates to generation of statistics in the management and execution of relational database queries.
Relational Database Management System (RDBMS) software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American Nationals Standard Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software, all data is externally structured into relations, each relation dealing with one or more attributes and comprising one or more tuples of data, each tuple associating attribute values with each other. A relation can be visualized as a table, having rows and columns (indeed, the relations in a particular database are often referred to as the xe2x80x9ctablesxe2x80x9d of the database). When a relation is visualized as a table, the columns of the table represent attributes of the relation, and the rows of the table represent individual tuples or records that are stored using those attributes. To aid in visualizing relations in this manner, In the following, the relations in an RDBMS system will frequently be referred to as that system""s xe2x80x9ctablesxe2x80x9d.
An RDBMS system may be used to store and manipulate a wide variety of data. For example, consider a simple RDBMS system for storing and manipulating name and address information. In such a system, a xe2x80x9cName/Addressxe2x80x9d table storing name and address attributes might have a first column xe2x80x9cLast Namexe2x80x9d for the last name, a second column xe2x80x9cFirst Namexe2x80x9d for the first name, a third column xe2x80x9cM.I.xe2x80x9d for middle initial, and further columns for addresses. Each row in the table would include the name and address data for a particular individual.
Often, columns in different tables are related. Thus, in the above example, the xe2x80x9cName/Addressxe2x80x9d table might have a column for a street address, and a column for a ZIP code or other postal code (i.e., a unique identifying index number or character string that identifies a general location of a street address). In this example, each row of the xe2x80x9cName/Addressxe2x80x9d table identifies a postal code for the individual described by that row, but does not identify a city. A second xe2x80x9cCityxe2x80x9d table might store city names; in two columns, a first column for the city name and a second column for the postal code of the city. Note that there may be multiple persons in each postal code and so the xe2x80x9cName/Addressxe2x80x9d table is likely to have multiple entries identifying the same postal code. Furthermore, there are likely to be multiple postal codes in the same city, so the xe2x80x9cCityxe2x80x9d table is likely to have multiple rows for the same city.
It should be noted that the order of tuples in a relation is not considered a feature of the relation; that is, two relations having the same tuples, but in a different order, are entirely equivalent to the RDBMS system. Tuple order is determined by user commands, if any, that specify a manner in which to sort the tuples. Absent a specified sort order, the order of the tuples in a relation is not defined, and tuples from the relation may be reported in any order.
An overall database organization is typically referred to as a schema for the database. A database schema is often compactly expressed using table names and names of columns in the table. Thus the simple schema including xe2x80x9cName/Addressxe2x80x9d and xe2x80x9cCityxe2x80x9d tables described in the above example, could be expressed as:
Name/Address(LastName,FirstName,M.I.,PostalCode, . . .)
City(City,PostalCode)
Database schema often take the form of a xe2x80x9cstarxe2x80x9d, where there is one very large xe2x80x9cmotherxe2x80x9d table and many small detail tables. Queries typically involve selections based on the attributes of the detail tables, e.g. the City table storing city names, followed by retrieval of information for the selected tuples from the mother table, e.g. the Name/Address table storing names and addresses for persons. From the foregoing description, it can be seen that to find the persons in a particular city, the rows in the xe2x80x9cCityxe2x80x9d table would be scanned to find those rows having the desired city name, and then the postal codes in those rows would be retrieved. Then, the xe2x80x9cName/Addressxe2x80x9d table would be scanned to locate all rows from the table having the postal code in the postal code column. The names and addresses for the selected rows are the individuals residing in the desired city.
A typical way of looking up information in tables uses indexes. For example, there may be an index into the Name/Address table identifying all of the rows that have a particular value for a postal code. This index is stored separately from the table and must be updated each time the table itself is updated. Accordingly, indexes introduce a substantial increase in storage requirements. However, if no index is available, a query into a table can only be satisfied by scanning each row of the table, which requires substantially longer processing time. In an environment such as decision support, where selections may be on an arbitrary number of detail tables, maximum speed requires indices on most or all columns of some tables, making such applications space-intensive. Typically, in other environments, performance is compromised to reduce storage requirements, by not providing a column index for at least some columns of a table.
One type of index is a bitmap index, which indicates whether a specific value exists for each row in a particular column. One bit represents each row. Thus, in the bitmap index for the value xe2x80x9c45246xe2x80x9d in the column xe2x80x9cPostal Code,xe2x80x9d the nth bit equals 1 if the nth row of the Name/Address table contains a postal code of xe2x80x9c45246xe2x80x9d, or 0 if that row holds a value other than xe2x80x9c45246xe2x80x9d. Typically there are multiple bitmap indexes for each column, one for each of several values that may appear in the column (e.g., one index for the value xe2x80x9c45246xe2x80x9d, a second index for the value xe2x80x9c45202xe2x80x9d, and so on). Another type of index is an encoded vector index. (EVI), disclosed U.S. Pat. No. 5,706,495, issued Jan. 6, 1998 to Chadha et al., entitled ENCODED-VECTOR INDICES FOR DECISION SUPPORT AND WAREHOUSING, which is incorporated by reference. An EVI serves a similar purpose as a bitmap index, but only one index is necessary to account for all the values occurring in the column (whether they be xe2x80x9c45246,xe2x80x9d xe2x80x9c45202,xe2x80x9d or any other). In an EVI on the xe2x80x9cPostal Codexe2x80x9d column, the nth position of the EVI contains a bit code, that can be decoded using a lookup table to produce the value xe2x80x9c45246xe2x80x9d, which is the postal code in the nth row of the table. Thus, whereas a separate bitmap index is required to map each particular key value in a database field, only one EVI is required to represent the same information. Thus, an EVI saves computer memory by including all possible key values for a given field in one database index. Notably, however, both a bitmap index and EVI index indexes only information relating to a single column of the table. These indexes do not reflect the relations between values in multiple columns.
Turning now to a description of SQL, the main RDBMS operation described in the preceding examples is known as a JOIN, in the illustrated case, between the xe2x80x9cCityxe2x80x9d table and xe2x80x9cName/Addressxe2x80x9d table. This is one example of the many operators that are provided by an SQL interface to RDBMS software. The SQL interface allows users to formulate relational operations on tables either interactively, in batch files, or embedded in host languages such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data; each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures such as the simple procedure in the above example, with a single statement.
The operators provided by SQL are derived from an original set of eight operators:
RESTRICT Extracts specified tuples from a specified relation (i.e., retrieves specified rows from a table) using a specified condition;
PROJECT Extracts specified attributes from a specified relation (i.e., retrieves specified columns from a table);
PRODUCT Builds a relation from two specified relations containing all possible combinations of tuples, one from each of the two relations (i.e., builds a table with rows created by combining all rows of two tables);
UNION Builds a relation consisting of all tuples appearing in either or both of two relations (i.e., builds a table having all rows appearing in either table);
INTERSECT Builds a relation consisting of all tuples appearing in both of two specified relations (i.e., builds a table having all rows appearing in both tables);
DIFFERENCE Builds a relation (table) consisting of all tuples (rows) appearing in the first and not the second of two specified relations (tables);
JOIN Builds a relation (table) from two specified relations (table) consisting of all possible combinations of tuples (rows), one from each of the two relations (tables), such that the two tuples (rows) contributing to any given combination satisfy a specified condition;
DIVIDE Builds a relation (table) from a first relation (table) having first and second attributes (columns) and a second relation (table) having one attribute (column), by selecting from the first table, values of the first attribute (column) of tuples (rows) whose second attribute (column) matches a value in the second table.
With respect to these basic operations, and the more detailed operations derived from them that form the entirety of SQL, it is notable that the result of each operation is a relation that is equivalent in structure to the relations that are used by the operation. This is an important feature in that it permits SQL operations to be written as nested relational expressions, where the output of one operation is used as the input to the next operation. The power of SQL arises in no small part from the ability to string SQL operations together in any desired order to perform a desired function.
As an example of this ability, we can expand the above example to illustrate a three-way table join. For this, assume a Population table is available, which identifies city populations, e.g., the Population table has a first column for a city name and a second column for the population of the city. Given the Name/Address, City and Population tables, it is possible to identify those persons living in cities with populations greater than a given threshold, by JOINing the Population and City tables with a population threshold criterion, and then JOINing the result with the Name/Address table.
At this point, it should be noted that an RDBMS system implementing an SQL query need not materialize the result of every individual SQL operation in its entirety. For example, the foregoing example seeking every address in the database that is in a particular city, formally involves performing a RESTRICT operation (restricting to the desired city) upon the results of a JOIN operation acting on Name/Address and City tables. It would be inefficient to compute the results of the JOIN operation, which would be a table of every address in the Name/Address table expanded to include city names, and then select only those addresses in the desired city. The more efficient approach, as described in the preceding example, would be to apply the restriction to each tuple of the JOIN operation as it is generated, keeping only those which meet the restriction and discarding the remainder. This way, an equivalent result is produced without materializing the result of the JOIN.
To achieve optimizations of this sort, an RDBMS system often includes substantial logic to avoid materializing intermediate results. For such optimizations, an RDBMS system utilizes statistics describing the data upon which SQL operations operate. Two such statistics pertinent to the described examples are a key range estimate and a join fanout estimate. A key range estimate estimates the number of rows in a given table that match a given condition. A join fanout estimate estimates the number of records likely to be produced by a JOIN operation. In the last described example, such statistics would indicate whether there are a large number of postal codes in the database for cities meeting the population criterion, and whether the Name/Address records are grouped in a few cities or spread across a large number of cities. These statistics would then be useful in determining whether it would be more efficient to (a.) join the Name/Address table with the City table before joining the Population table, by building an intermediate result with the name, address and city for every individual, and then selecting from this intermediate result, those individuals in cities meeting the population criterion, or (b.) join the City and Population tables before joining the Name/Address table, by building an intermediate result with the postal codes in the cities meeting the population criterion, and then selecting those individuals in the Name/Address table having the postal codes in the intermediate result. Typically, the most efficient approach is the approach that produces the smallest intermediate result.
Statistics of this form are frequently generated using available indeces for the tables upon which SQL operations will be performed. For example, a key range estimate can be formed from a bit mapped index by totaling the number of xe2x80x9c1xe2x80x9d bits in the index under the identified key value(s). Unfortunately, because many columns in modem RDBMS systems are not indexed, key range and join fanout statistics are often unavailable or inaccurate, resulting in substantial inefficiencies because queries are performed in other than the optimal manner due to the missing or inaccurate statistics.
As will be appreciated from the foregoing, modem RDBMS systems suffer from a number of drawbacks that render their operation less than optimal; key among these are the space requirements associated with generating indexes and, due to the consequent lack of complete indexing, the lack of accurate statistics that can be useful in optimizing RDBMS operations. What is needed is a alternative representation of data in a RDBMS system that facilitates the generation of SQL results and statistics regarding such results without consumption of inordinate storage space.
In accordance with principles of the present invention, these needs are met by a novel representation of an n-attribute relation as an order n relational tensor, thereby facilitating the generation of query results.
Specifically, in accordance with principles of the present invention, a relational database system stores a plurality of tuples formed over a plurality of attributes, in a multi-order relational tensor. Orders of the relational tensor respectively correspond to each of the attributes, and each coordinate along an order relates to a key value of the corresponding attribute. Numeric values are stored in the relational tensor, each numeric value representing a count of tuples having the attribute key values that correspond to the coordinate of the numeric value along the orders of the relational tensor. This storage representation is useful in a variety of contexts for enhancing the performance of a RDBMS system.
Specifically, in a first aspect of the invention, a data-representing relational tensor is processed to generate a statistic for a user query specifying a relational operation on at least one attribute over which the data-representing relational tensor was formed. In this embodiment, a relational database system determines, from among two candidate approaches, an approach to use in processing the user query based on the statistic.
In a second aspect of the present invention, a data-representing relational tensor is used to produce results for a restrict operation such as the SQL operations DISTINCT, PROJECTION, EQUALS, LESS THAN, LESS THAN OR EQUAL, GREATER THAN, GREATER THAN OR EQUAL and LIKE. In accordance with this aspect, a relational database system processes a data-representing relational tensor to produce a result relational tensor identifying the results of the restrict operation.
These and other features and advantages, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawing, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.