The present invention relates generally to the area of database management systems and, more particularly, to optimized methods and system for querying information from those systems.
Computers are a powerful tool for the acquisition and processing of information. Of particular interest to the present invention are database management systems (DBMSs), which can be regarded as a kind of electronic filing cabinet or repository for collecting computerized data files. As such, these systems serve to maintain information in database files or tables and make that information available on demand.
A database management system serves as a software cushion or layer between the physical database itself (i.e., the data actually stored on a storage device) and the users of the system, thus shielding the database users from underlying hardware-level details. In particular, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
Users of a database management system are perhaps most interested in querying a database, that is, retrieving information according to user-supplied selection criteria. Query operations can be performed to find, change, add, or remove information. More particularly, the user can insert new records into a table, delete records from a table, change values in a field, and find records in a table. A user may ask several different types of questions about the information stored in the DBMS. For example, a user can query a single table, or join two or more tables for a single query. In addition, the user can employ queries to define groups and sets of information within his or her tables and to perform calculations and comparisons on those sets. All told, there is no limit to the complexity with which a query may be constructed.
Perhaps the most popular method for querying a database is Query-by-example (QBE), which allows a user to ask questions and perform query operations by constructing simple examples. Instead of spending time figuring out how to perform the query, the user simply gives the system an example of the result he or she is looking for. In turn, the DBMS automatically determines how to get the answer or perform the operation. Since QBE is interactive and iterative, its queries are highly flexible. If a query does not quite obtain the desired result, for instance, the user can easily fine tune the query and run it again. By constructing queries which build on each other, the user can play "what if" scenarios with his or her data.
Whether performed through a query interface or manually (i.e., command-line or script driven), the basic operation of a query centers around the definition of search criteria or conditions. Search criteria tell the DBMS which records the user wants selected (e.g., for including in a resulting view or answer table). Search criteria are like filter or tests that select records based on entries in one or more fields. For instance, a simple selection condition could include state equal to "CA" (e.g., query expression of STATE="CA"); this condition instructs the DBMS to include only those records with the state field entry equal to "CA".
Many types of selection conditions may be specified. The user can create, for example, selection conditions for alphanumeric, numeric, dollar, and date fields. In addition to specifying exact-match conditions (e.g., AGE=59), users may select records that have entries which fall within a certain range. For example, a user may wish to locate all records meeting a particular range (e.g., AGE&gt;59). This is particularly helpful when one is selecting records based on entries in numeric, dollar, or date fields.
In addition to specifying a single criterion, a user may specify a plurality of criteria. Multiple criteria require a matching record to meet either of two or more tests (logical "OR", such as LASTNAME="JONES" .OR. FIRSTNAME="KEVIN"), or to simultaneously meet all of two or more conditions (logical "AND", such as LASTNAME="JONES" .AND. FIRSTNAME="KEVIN"). Also, criteria may be specified having various combinations of AND and OR conditions, including nesting of conditions, such as LASTNAME="JONES" .AND. (FIRSTNAME="KEVIN" .OR. FIRSTNAME="JOHN").
It is known in the art to employ one or more index files when searching for information stored in a database file. A typical index file stores for each record in the database file a key expression--data derived from one or more fields of the record (e.g., "JONES" derived from a last name field)--together with a record pointer (e.g., record number) for the record. Index files are usually organized into a format which is efficient for searching (e.g., B-tree), ordered according to the key expression, such as alphabetically. Because information about records may be determined without examining the underlying records themselves, the index may greatly speed up searching (querying) for and sorting of information.
The use of indexing for satisfying searches may be further enhanced by simple use of bitmaps, as has been employed by various DBMS vendors. Atypical implementation is described in SYSTEM AND METHODS FOR INFORMATION RETRIEVAL, International Application No. PCT/US91/07260, International Publication No. WO 92/06440, Apr. 16, 1992, which is hereby incorporated by reference. General operation of such systems will be briefly reviewed.
Consider a query expression of LASTNAME="FOX" .AND. AGE=80, with an existing index on LASTNAME and another index on AGE. The query may be "optimized" by recognizing that existing indices are available for satisfying each logical clause, and that query matches or hits from these indices may be conveniently mapped into one or more solution bitmaps. In operation, the first index may be scanned to determine a first subset of records satisfying the first half (first logical clause) of the query (i.e., LASTNAME="FOX"). The second index may be scanned to determine a second subset of records satisfying the second half (second logical clause) of the query (i.e., AGE=80).
The solution bitmaps or "bitmasks" are constructed as follows. A first temporary solution bitmask is constructed during scanning of the first index. Specifically, if the search criteria is satisfied by a particular record's key value stored in the index file, the corresponding bit (i.e., nth bit for nth record) in the temporary solution bitmask is set. In a like manner, as the second index is scanned a second temporary solution bitmask is constructed, where the corresponding bit in the second temporary solution bitmask is set if the search criteria is satisfied by a particular key value in the index file. By combining the two bitmasks (bitwise operation of "ANDing" the two bit sets), a final solution is ascertainable, all without having accessed the underlying physical records.
Although simple bitmap optimization techniques, such as the foregoing, may improve searching efficiency under appropriate circumstances, the techniques have pronounced shortcomings. In particular, real-world queries (i.e., ones most likely to be useful in everyday, normal user operation) often demand searches on conditions which are "non-optimizable" under these systems. Consider, for example, the query LASTNAME="FULTON" .AND. STATE="CA" specified by a user at runtime. To apply optimization, prior art techniques require at least one pre-existing index which directly supports the query. For the foregoing query, such systems would require at least one index on LASTNAME or on STATE. And to "fully optimize" the search (i.e., apply bitmap optimization), such indices must directly support its corresponding logical clause. That is, for the clause LASTNAME="FULTON" the index must be only on LASTNAME--INDEX ON LASTNAME, but not INDEX ON LASTNAME+FIRSTNAME. Such an approach ignores the realities of everyday use situations where users specify ad hoc query conditions for solving real-world data processing problems.
Most often in a given system, indices will not be available which directly support the query conditions that a user has specified. In the above example, for instance, an index for alphabetizing a customer list would more likely be a composite index on all name fields (e.g., INDEX ON LASTNAME+FIRSTNAME+MI). An index on last name by itself would probably not be as useful to maintain active at all times, unless the database programmer knew before what the ad hoc queries were likely to be. But even if such information were known beforehand, it is still not very practical to keep an active index for every field which is likely to be queried--updating a multitude of indices for a large database table incurs a substantial performance penalty. All told, the design limitations of prior art systems restrict those systems to optimizing a small subset of the queries which would actually be useful to end-users.
What is needed are system and methods which intelligently process all available indices for satisfying a query, including ones which may not directly support the query expression under examination. The present invention fulfills this and other needs.