1. Field of the Invention
The present invention relates to indexing data and, more particularly, to a method and apparatus wherein bit vector indexing is used to index data such as record data in a database.
2. Description of the Related Art
A DBMS (Database Management System) is used to manage data and is comprised of computer-executable code that may be used to define structure of data and to access data within the defined structure. One example of a DBMS is a relational DBMS, or RDBMS. An RDBMS manages tables that make up a relational database as well as the data contained in the tables. In an RDBMS, data is organized in rows (or records) and columns (or fields) of the tables, and two or more tables may be related based on like data values. The intersection of a row and column in a table is referred to as a cell and contains the data value for a particular field of a particular record.
A DML (data manipulation language) such as SQL (Structured Query Language) is typically used to store, retrieve and modify data in a table. A schema defines the structure of a database, i.e., each table and the fields within a record of a table. A schema is itself considered data that is stored in one or more tables. Therefore, like other data in a database, a DML may be used to store, retrieve and modify the data in the database as well as the structure of a database.
There are performance issues with respect to data access in a DBMS particularly when the database is very large. A typical RDBMS is optimized for certain types of query access. However, performance degrades when the database is very large, when a query returns a large set of records, when row selection criteria apply across multiple fields and tables, or when interactively browsing large sets of query results. Value limiting on a lookup table reduces the set of lookup values by eliminating from the set of all possible lookup values those values that do not correspond to any records in the primary table. Another problem is that value limiting cannot efficiently and quickly be done using the standard mechanisms of an RDBMS.
An RDBMS uses indexes to assist in performing queries and quickly locate records in the database. Indexes store one or more field (or column) values from each record as a unique key for the record. Indexes do an adequate job of speeding up the query process even on large databases when the row selection criteria include constraints on only a single field and when the query results do not need to be browsed interactively. In particular, an RDBMS can quickly search for and retrieve an individual record from among even millions of records based on the value of an indexed field.
Unfortunately, however, there are a number of shortcomings to searching using conventional forms of indexing.
For example, if the number of records in the database is very large, the index itself can become large as well, so that as a practical matter it will be stored on disk rather than in memory, moderately increasing the time necessary to search for records.
Further, if the row selection criteria includes constraints for multiple fields, the speedup only applies to each field individually. The problem of then reconciling the multiple sets of query results, one for each constraint, into a single set of query results for all the constraints, requires complex algorithms and heuristics that can dramatically increase the time necessary to execute the query. In fact, the time required grows geometrically with the number of records in the individual result sets, which means that query constraints that return very large sets of records reduce system performance.
A further disadvantage of conventional indexing relates to situations in which a user wishes to interactively browse query results which results in performance degradation. Most relational database management systems support interactive browsing using cursors and temporary files, which require that the entire set of query results first be written to disk before browsing, a very slow operation compared to memory access. Moreover, the set of query results must be accessed and written to disk in its entirety even if only a small subset of the records will ever be brought into view. Again, if the set of query results is very large, the operation of writing them to disk can take a long time.
Also, if a user chooses to build up a query interactively and iteratively by adding one constraint at a time and viewing intermediate query results along the way, the entire query process needs to be repeated from scratch as each additional constraint is added to the query, each time incurring all of the overhead of each of the steps of reading the index, applying multiple row selection criteria, reconciling query results, and writing them to disk.
A final problem arises when attempting to perform value limiting. Value limiting allows the system to present the user with lists of values for search selections that always correspond to records in the primary table, preventing the user from making search selections that lead to no records found. Unfortunately, a typical RDBMS can only accomplish this process through complex, multi-table joins (i.e., a join combines information from two tables by performing a lookup on every record of the primary table) that cannot usually be done quickly enough to provide an acceptable response time in an interactive environment. A lookup uses a pair of matching columns from two tables, taking the value of the column for a single record in the first primary table to “look up” additional information in a single corresponding record in the second lookup table. As a result, value limiting is impractical when performing interactive and iterative searches because the value limiting would have to be done across multiple lookup tables again and again.
Thus, it would be beneficial to have a mechanism to more efficiently index data in data records such as those stored in a database.