The present invention relates to relational database management systems (RDBMS), and more particularly to computerized systems for storing and accessing large amounts of data.
In a non-limiting manner, the invention is applicable to xe2x80x9cdata warehousesxe2x80x9d. On-line transaction processing (OLTP) systems, such as for bank teller transactions and airline reservations, are optimized for finding a record associated with a specific key, e.g. finding the information about employee 123124. By contrast, data warehouses are optimized for finding sets of records very quickly. The reason is that typical queries are of the form: xe2x80x9cfind all sales by region and quarterxe2x80x9d or xe2x80x9cfind stores that sell the greatest volume of sportswear per monthxe2x80x9d or xe2x80x9cselect the top 5 stores for each product category for the last yearxe2x80x9d. Such queries must typically access large sets of rows in data tables. The query processing challenge is to process these queries without doing a linear scan of all or most of the database.
Five main approaches have been proposed to attack this problem: (i) multidimensional arrays; (ii) special indexes; (iii) table caching; (iv) optimized foreign key joins; and (v) approximation.
(i) Multidimensional Arrays (i.e. Matrices).
This strategy consists of implementing the data warehouse as a multidimensional array or matrix. Examples may be found in U.S. Pat. No. 5,359,724 and U.S. Pat. No. 5,864,857. Each dimension corresponds to an attribute of the data. For example, a sales table can be viewed as a matrix with coordinates: store location, product type, customer id, and so on. A particular sale can be identified by specifying all of these attributes. The strategy works well for small databases or very dense ones. By dense, we mean that the Cartesian product of possible values should all be meaningful, e.g., every customer is likely to buy every product from every store. Since this is rarely true, this scheme must be modified to deal with sparse values. This can be done by defining a notion of sparse attributes and dense ones. So, for example, it might be that every store carries every product (a dense relationship that can be stored in a matrix), but only some of these combinations are valid for any given customer. So, a conventional index would be used whenever customer sales are involved, but a dense one for queries involving store-wide or product-wide sales.
(ii) Special Indexes.
Bitmap indexes are an index structure tailored to data warehouses (see, e.g. U.S. Pat. No. 5,903,888). These indexes have already been used in some commercial products to speed up query processing. In its simplest form, a bitmap index on an attribute consists of one vector of bits (i.e. bitmap) per attribute value, where the size of each bitmap is equal to the number of records in the indexed relation. For example, if the attribute is day-of-week, then there would be seven bitmap vectors for that attribute, one for each day. The bitmap vector corresponding to Monday would have a 1 at position i if record i contains xe2x80x9cMondayxe2x80x9d in the day-of-week attribute. This single value-based approach is called a Value-List index. Other techniques (e.g. U.S. Pat. No. 5,761,652) associate bit vectors with ranges of values, so there could, for a salary attribute, be a vector for the range 0 to 20,000 Euros, 20,000.01 to 35,000 Euros, and so on. Still others associate each bit vector with a bit value (a 1 or a 0) in a given position. So, if the attribute holds n bit numbers, then there would be 2n bit vectors (position 1, bit value 1; position 1, bit value 0; position 2 bit value 1; . . . ).
The benefit of bit vectors is that it is easy to use multiple bit vectors to answer a single query. Consider a query on several predicates, each of which is indexed. Most conventional database management systems would use just one of the indexes (the one that is most xe2x80x9cselectivexe2x80x9d so returns the fewest rows), though some systems might attempt to intersect the record identifiers of multiple indexes.
Bitmaps work better, because they are more compact and intersecting several bitmaps is much faster than intersecting several collections of record identifiers. In the best case, the improvement is proportional to the word size of the machine. For example, suppose the word size is 32 bits. Then two bit vectors can be intersected 32 bits at a time. Each set of 32 bits corresponds to 32 record identifiers being intersected. That best case occurs when each predicate is unselective (i.e. many records match each predicate value), but all the predicates together are quite selective. Consider for example the query: xe2x80x9cFind people who have brown hair, glasses, ages between 30 and 40, blue eyes, work in the computer industry, live in California, . . . xe2x80x9d.
So, matrices are best when sets of predicates are dense (all, or nearly all, values in the Cartesian product are possible), bitmaps are best when predicates are neither dense nor individually selective. An intermediate approach (when there is insufficient density for matrices but many values in the Cartesian product are present) is to use multidimensional indexes. Multidimensional indexes such as quadtrees, R-trees and their successors are implemented as variable sized grids on a multidimensional space. The grids are of variable sizes because the population of points differs in different places in a hyperspace. For intuition, consider a map of equi-population rectangles of France. The rectangles would be far more dense in Paris than in the alps. Indexes like this work well for spatial data (where they are used to find the points contained in latitude-longitude quadrants). This alternative is little explored in the commercial arena except for geographical queries, however, because these schemes do not scale well with increasing dimensionality and commercial systems typically have far more than three dimensions.
(iii) Table Caching.
If one doesn""t have the luxury to design new indexes on top of a database system (because one is not the implementer of that system) one can pre-compute a large number of anticipated aggregate queries and put them in tables. For example, if a large retailer frequently asks queries that sum the total sales across multiple stores or multiple products, one may store such information in special tables. The main cost of such a strategy is maintaining these tables in the face of updates. (Disk space is no longer a major factor.) In the example, every sale of item I at store S would have to update the total product sales table for I and the total store sales table for S. So, this strategy is worthwhile if there are few updates between queries. The strategy is not worthwhile if there are many.
(iv) Optimized Foreign Key Joins.
Most queries in multidimensional tables entail joins between a central xe2x80x9cfact tablexe2x80x9d (e.g. sales detail) and a set of dimension tables (e.g. store description, product description, customer description). These are known as xe2x80x9cforeign key joinsxe2x80x9d since the customer identifier in the sales table, for example, is a key of the customer description table. (A key is a value belonging to an attribute such that only one record has that value in the attribute.) One way to accelerate these joins is to create a linkage between fact table records and dimension records. This can be done in three basic ways
(a) create an index that holds fact table record identifiers and dimension table record identifiers;
(b) create bidirectional pointers between fact table records and dimension table rowsxe2x80x94this is what xe2x80x9cobject-orientedxe2x80x9d databases do;
(c) replace the customer record identifiers in the fact table by offsets into the dimension tables.
Choice (a) is the most independent of changes in the physical organization of the tables and therefore is best for heavily updated systems, because changes to the dimension table can be reflected in the index to that table alone. Choice (b) is the least flexible to physical reorganization, because reorganizing a dimension table would entail updating the fact table. Choice (c) is a compromise of the two in that certain physical reorganizations can be done to the dimension tables (e.g. changing its position on disk) without changing the fact table. Examples of join optimization may be found in U.S. Pat. No. 5,548,754, U.S. Pat. No. 5,671,403, U.S. Pat. No. 5,724,568, U.S. Pat. No. 5,752,017, U.S. Pat. No. 5,761,657 and U.S. Pat. No. 5,822,747.
(v) Approximating the Result
Since most people use data warehouses to get strategic aggregate information, many would be happy with a fast approximation as long as it has error bounds. Typical work in this area is illustrated by U.S. Pat. No. 5,870,752, which shows how to estimate aggregate results in data warehouses while giving error bounds. The basic problem is that sampling all tables and then doing aggregates does not work in general. For example, if one wants to join R and S on their keys, then taking a {fraction (1/10)} sample of each will give a size that is {fraction (1/100)} of the size of the real join if the samples are random. So, one must be more clever. The idea is to take an initial set of tables R, S, T, . . . that are linked by foreign key joins. Suppose for example that R is the fact table and the others are dimension tables. Take a sample of R and then perform all these foreign key joins based on the sample giving Rxe2x80x2, Sxe2x80x2, Txe2x80x2, . . . . Now, if a query involves R, S, T and includes the foreign key links among these, then the query can be done with great accuracy on Rxe2x80x2, Sxe2x80x2, Txe2x80x2. The error can be estimated by considering the result obtained by several partitions of Rxe2x80x2 and looking at their variance.
An object of the present invention is to propose an alternative method of organizing a database management system, which enables an efficient query processing.
The invention proposes a method of organizing information in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes. The group of attributes is divided into a plurality of sub-groups each associated with a respective data table, each data table having a column for each attribute of the associated sub-group and rows for containing data table records comprising at least one word assigned to an attribute of the associated sub-group. Links are defined between the data tables records, each link having a target table and a corresponding source table having a link column containing link values each designating a record of said target table. Each of said link values represents a link between the record of the source table including said link value and the record of the target table designated by said link value. The method comprises the steps of:
allocating respective identifiers to data graphs, wherein each data graph represents related attribute values respectively assigned to the attributes of said group, wherein each attribute value of a data graph is either a default value or a word of said collection of data, and wherein the words of each data graph are from linked data table records;
storing a plurality of word thesauruses respectively associated with attributes of said group, wherein for each word assigned at least once to an attribute in the collection of data, the word thesaurus associated with said attribute has a respective entry containing said word; and
storing data representing data graph identifier lists respectively associated with the word thesaurus entries, wherein the data graph identifier list associated with a thesaurus entry relating to a word assigned to an attribute includes any identifier allocated to a data graph having said word assigned to said attribute.
The invention further proposes a method of processing queries in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes, the group of attributes being divided into a plurality of sub-groups respectively associated with a plurality of data tables having independent numbers of records, with links between respective records from the data tables. Identifiers are respectively allocated to data graphs, each data graph representing related attribute values respectively assigned to the attributes of said group, each attribute value of a data graph being either a default value or a word of said collection of data. A plurality of thesauruses each associated with a respective attribute of said group and data representing data graph identifier lists respectively associated with entries of said thesauruses are stored. Each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said one attribute and has a respective entry for each subset including at least one word assigned to said one attribute in the collection of data, the data graph identifier list associated with said thesaurus entry including any identifier allocated to a data graph having a word of said subset assigned to said one attribute. The method comprises the steps of:
analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;
determining a matching data graph identifier list based on said combination and on the stored data representing the data graph identifier lists associated with said relevant thesaurus entries;
processing said matching data graph identifier list to output a response.
The invention further proposes a database system implementing a method as outlined above, and computer program products having instructions for carrying out such method.