The present invention relates to relational database management systems (RDBMS), and more particularly to computerized systems for storing and accessing large amounts of data.
In a non-limiting manner, the invention is applicable to xe2x80x9cdata warehousesxe2x80x9d. On-line transaction processing (OLTP) systems, such as for bank teller transactions and airline reservations, are optimized for finding a record associated with a specific key, e.g. finding the information about employee 123124. By contrast, data warehouses are optimized for finding sets of records very quickly. The reason is that typical queries are of the form: xe2x80x9cfind all sales by region and quarterxe2x80x9d or xe2x80x9cfind stores that sell the greatest volume of sportswear per monthxe2x80x9d or xe2x80x9cselect the top 5 stores for each product category for the last yearxe2x80x9d. Such queries must typically access large sets of rows in data tables. The query processing challenge is to process these queries without doing a linear scan of all or most of the database.
Five main approaches have been proposed to attack this problem: (i) multidimensional arrays; (ii) special indexes; (iii) table caching; (iv) optimized foreign key joins; and (v) approximation.
(i) Multidimensional Arrays (i.e. Matrices)
This strategy consists of implementing the data warehouse as a multidimensional array or matrix. Examples may be found in U.S. Pat. Nos. 5,359,724 and 5,864,857. Each dimension corresponds to an attribute of the data. For example, a sales table can be viewed as a matrix with coordinates: store location, product type, customer id, and so on. A particular sale can be identified by specifying all of these attributes. The strategy works well for small databases or very dense ones. By dense, we mean that the Cartesian product of possible values should all be meaningful, e.g., every customer is likely to buy every product from every store. Since this is rarely true, this scheme must be modified to deal with sparse values. This can be done by defining a notion of sparse attributes and dense ones. So, for example, it might be that every store carries every product (a dense relationship that can be stored in a matrix), but only some of these combinations are valid for any given customer. So, a conventional index would be used whenever customer sales are involved, but a dense one for queries involving store-wide or product-wide sales.
(ii) Special Indexes
Bitmap indexes are an index structure tailored to data warehouses (see, e.g. U.S. Pat. No. 5,903,888). These indexes have already been used in some commercial products to speed up query processing. In its simplest form, a bitmap index on an attribute consists of one vector of bits (i.e. bitmap) per attribute value, where the size of each bitmap is equal to the number of records in the indexed relation. For example, if the attribute is day-of-week, then there would be seven bitmap vectors for that attribute, one for each day. The bitmap vector corresponding to Monday would have a 1 at position i if record i contains xe2x80x9cMondayxe2x80x9d in the day-of-week attribute. This single value-based approach is called a Value-List index. Other techniques (e.g. U.S. Pat. No. 5,761,652) associate bit vectors with ranges of values, so there could, for a salary attribute, be a vector for the range 0 to 20,000 Euros, 20,000.01 to 35,000 Euros, and so on. Still others associate each bit vector with a bit value (a 1 or a 0) in a given position. So, if the attribute holds n bit numbers, then there would be 2n bit vectors (position 1, bit value 1; position 1, bit value 0; position 2 bit value 1; . . . ).
The benefit of bit vectors is that it is easy to use multiple bit vectors to answer a single query. Consider a query on several predicates, each of which is indexed. Most conventional database management systems would use just one of the indexes (the one that is most xe2x80x9cselectivexe2x80x9d so returns the fewest rows), though some systems might attempt to intersect the record identifiers of multiple indexes.
Bitmaps work better, because they are more compact and intersecting several bitmaps is much faster than intersecting several collections of record identifiers. In the best case, the improvement is proportional to the word size of the machine. For example, suppose the word size is 32 bits. Then two bit vectors can be intersected 32 bits at a time. Each set of 32 bits corresponds to 32 record identifiers being intersected. That best case occurs when each predicate is unselective (i.e. many records match each predicate value), but all the predicates together are quite selective. Consider for example the query: xe2x80x9cFind people who have brown hair, glasses, ages between 30 and 40, blue eyes, work in the computer industry, live in California, . . . xe2x80x9d.
So, matrices are best when sets of predicates are dense (all, or nearly all, values in the Cartesian product are possible), bitmaps are best when predicates are neither dense nor individually selective. An intermediate approach (when there is insufficient density for matrices but many values in the Cartesian product are present) is to use multidimensional indexes. Multidimensional indexes such as quadtrees, R-trees and their successors are implemented as variable sized grids on a multidimensional space. The grids are of variable sizes because the population of points differs in different places in a hyperspace. For intuition, consider a map of equi-population rectangles of France. The rectangles would be far more dense in Paris than in the alps. Indexes like this work well for spatial data (where they are used to find the points contained in latitude-longitude quadrants). This alternative is little explored in the commercial arena except for geographical queries, however, because these schemes do not scale well with increasing dimensionality and commercial systems typically have far more than three dimensions.
(iii) Table Caching
If one doesn""t have the luxury to design new indexes on top of a database system (because one is not the implementer of that system) one can pre-compute a large number of anticipated aggregate queries and put them in tables. For example, if a large retailer frequently asks queries that sum the total sales across multiple stores or multiple products, one may store such information in special tables. The main cost of such a strategy is maintaining these tables in the face of updates. (Disk space is no longer a major factor.) In the example, every sale of item I at store S would have to update the total product sales table for I and the total store sales table for S. So, this strategy is worthwhile if there are few updates between queries. The strategy is not worthwhile if there are many.
(iv) Optimized Foreign Key Joins
Most queries in multidimensional tables entail joins between a central xe2x80x9cfact tablexe2x80x9d (e.g. sales detail) and a set of dimension tables (e.g. store description, product description, customer description). These are known as xe2x80x9cforeign key joinsxe2x80x9d since the customer identifier in the sales table, for example, is a key of the customer description table. (A key is a value belonging to an attribute such that only one record has that value in the attribute.) One way to accelerate these joins is to create a linkage between fact table records and dimension records. This can be done in three basic ways
(a) create an index that holds fact table record identifiers and dimension table record identifiers;
(b) create bidirectional pointers between fact table records and dimension table rowsxe2x80x94this is what xe2x80x9cobject-orientedxe2x80x9d databases do;
(c) replace the customer record identifiers in the fact table by offsets into the dimension tables.
Choice (a) is the most independent of changes in the physical organization of the tables and therefore is best for heavily updated systems, because changes to the dimension table can be reflected in the index to that table alone. Choice (b) is the least flexible to physical reorganization, because reorganizing a dimension table would entail updating the fact table. Choice (c) is a compromise of the two in that certain physical reorganizations can be done to the dimension tables (e.g. changing its position on disk) without changing the fact table. Examples of join optimization may be found in U.S. Pat. Nos. 5,548,754, 5,671,403, 5,724,568, 5,752,017, 5,761,657 and 5,822,747.
(v) Approximating the Result
Since most people use data warehouses to get strategic aggregate information, many would be happy with a fast approximation as long as it has error bounds. Typical work in this area is illustrated by U.S. Pat. No. 5,870,752, which shows how to estimate aggregate results in data warehouses while giving error bounds. The basic problem is that sampling all tables and then doing aggregates does not work in general. For example, if one wants to join R and S on their keys, then taking a {fraction (1/10)} sample of each will give a size that is {fraction (1/100)} of the size of the real join if the samples are random. So, one must be more clever. The idea is to take an initial set of tables R, S, T, . . . that are linked by foreign key joins. Suppose for example that R is the fact table and the others are dimension tables. Take a sample of R and then perform all these foreign key joins based on the sample giving Rxe2x80x2, Sxe2x80x2, Txe2x80x2, . . . . Now, if a query involves R, S, T and includes the foreign key links among these, then the query can be done with great accuracy on Rxe2x80x2, Sxe2x80x2, Txe2x80x2. The error can be estimated by considering the result obtained by several partitions of Rxe2x80x2 and looking at their variance.
An object of the present invention is to propose an alternative method of organizing a database management system, which enables an efficient query processing.
The invention proposes a method of organizing information in a database system. A plurality of row identifiers are defined to designate respective rows of a reference table having columns respectively associated with data attributes. These rows contain groups of related words assigned to the attributes in a collection of data. The method comprises storing at least one macroword thesaurus associated with one of the attributes and with a prefix length shorter than a word length of said attribute.
The macroword thesaurus has a respective entry for each prefix value having said prefix length and matching a corresponding prefix of at least one word assigned to said data attribute in the collection of data.
Data representing identifier lists respectively associated with the macroword thesaurus entries are also stored. The identifier list associated with an entry, relating to a prefix value, of a macroword thesaurus associated with an attribute includes any row identifier designating a row of the reference table having a word whose corresponding prefix matches said prefix value in the column associated with said attribute.
Another aspect of the invention relates to a method of processing queries in a database system in which the information is organized as indicated hereabove. A plurality of thesauruses each associated with a respective attribute and data representing reference table row identifier lists respectively associated with entries of these thesauruses are stored. Each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to that attribute and has a respective entry for each subset including at least one word assigned to the attribute in the collection of data. The reference table row identifier list associated with such thesaurus entry includes any identifier allocated to a row of the reference table having a word of the subset assigned to the attribute.
The thesaurus includes at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of that attribute. This macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having the corresponding prefix length. The query processing method comprises the steps of:
analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;
determining a matching reference table row identifier list based on such combination and on the stored data representing the reference table row identifier lists associated with the relevant thesaurus entries; and
processing matching row identifier list to output a response.
The invention further proposes a database system implementing methods as outlined above, and computer program products having instructions for carrying out such method.