In database data processing systems, it is often desirable to provide efficient, high-speed data access and searching capabilities for data stored in a database. A typical database system provides an index mechanism for accessing records of data in the database, without having to search through each element of data stored in each data record. There are many database indexing, accessing and searching techniques in widespread use.
As an example, a database containing records of corporate names and addresses may be indexed by company name. Each record in the database may contain address and other information for one specific company. Records may be sequentially stored in one large file, for example, on a computer readable disk. An ordered index of the company names may also be stored in a table on the disk. Generally, one corporate name entry exists in the index for each record in the database. Because the index is ordered, for example alphabetically, corporate names may be searched based on their position within the index. Each index entry also contains a reference to its corresponding database record.
To search the index, a search name is provided by a user or program. The search name is then compared with company name entries in the index. The search may start, for example, at an index entry for a corporate name beginning with the same first letter as the search name. Index entry names are then compared with the search name. When a match is found, the reference, such as a memory address or record number, provided by the index entry for the matching company name, is followed to obtain the entire company record from the database. Indexing provides a way to access data without having to search every record when looking for specific information in the database.
Tree-based indexes are another form of indexing and searching mechanism. In tree-based index database systems, a common data field in each database record is used as a keyword to create the index. The index is organized as a tree data structure, having a head node where searches begin, and one or more branch nodes referenced from the head node. All other nodes below the head node may also contain one or more branches referring to other nodes. Each index node contains one or more pointers, such as record numbers, to that node's respective data record within the database.
To search the tree index, a search value is provided by a user or program. The search value is then compared with node values beginning with the head node. At each node in the tree, if the search value occurs, for example, alphabetically before the current node's value, one branch may be followed to the next node, but if the search value occurs alphabetically after the current node's value, another branch to a different node may be taken. If the search value and node value are equal, a matching node has been found. The matching node's corresponding database record reference is used to retrieve the matching search data from the database.
The aforementioned database indexing and searching techniques are, however, generally slow and cumbersome when applied to very large databases. These types of indexes often have a one-to-one index entry to record relationship. Search times for a single search value in a one-to-one index, for example, may be on the order of Log.sub.2 N. A search for one value may be relatively quick, but a search for many matching records of many search values quickly becomes time consuming. In databases containing millions of record entries, searching multiple values may require long search times to find many record matches. Long search times are unacceptable for certain types of real-time applications.
Another type of index is the inverted index. In inverted indexes, a single index entry may reference many database records. Inverted indexes have been used successfully to produce large numbers of look-ups in a database. Consider an inverted index used to index a large collection of documents through the words contained in those documents. Each document may be viewed as a database record. In a simple index created from a small dictionary of the English language, about one million different words provide a fairly robust vocabulary, and the inverted index for this vocabulary contains an index entry for each of the one million words. Each word entry in the index references every document in which that word exists. Searching for multiple matches per index entry is generally faster when using inverted indexes, since each index entry may reference many database records.
The inverted index has been used successfully in Marketplace.TM. software, sold by iMarket, Inc., to search one large database. That software includes an inverted index to selected fields of the large database. Search criteria obtained from a user is applied to the index to locate matching records in the database. To minimize the storage space required by the large inverted index, the records containing each specific term within the index are identified by a bit map for that term. The storage required by the bit map is itself reduced by the use of a range definition for multiple adjacent records containing the same term. Bit map indexes may be used with limiting fields, such as a ZIP code, to select only those records within a specific range of record numbers identified by the bit map. For more information on inverted indexes, the reader may consult "Managing Gigabytes: Compressing and Indexing Documents and Images" by I. Witten, A. Moffat and T. Bell, 1994, Van Nostrand Reinhold, Inc., which is incorporated herein by reference.