1. Field of the Invention
The present invention relates generally to data storage and retrieval techniques, and more particularly to a system and method for rapidly identifying the existence and location of an item in a file.
2. Description of Background Art
In many computer-related applications, it is useful to rapidly identify whether or not a particular item exists in a stored file, database, or table. For example, one such application involves an implementation of a content directory of World Wide Web sites, including listings of Uniform Resource Locators (URLs) identifying on-line documents. It may be useful for a user or automated software application to identify whether or not a particular URL is listed in a particular content directory. Mechanisms for searching multiple pieces of text-based information in a document space such as the World Wide Web often take one of two types. The first type of search mechanism involves providing a text string to a search engine, which then retrieves a descriptor or identifier for any document containing the specified text string. Various combinations of text-strings and Boolean operators may be provided to implement more complex searches. However, the literal nature of such text-based searches often results in retrieval of documents that are unrelated to the intended meaning or context of the search terms. For example, a search for information on lions using the word “lion” as a search term may result in retrieval of documents describing the motion picture “The Lion King”, community service clubs such as “Lion's Club”, and other documents unrelated to the intended object of the search.
The second type of search mechanism is a category search, which employs a category directory describing a hierarchy of information categories. The search is performed by traversing the hierarchy to successively narrower categories until the desired set of documents is reached. Therefore, a search for information on lions might begin with a broad category of “science”, then proceed down the hierarchy to “biology”, “zoology”, “mammals”, and so forth. This approach tends to lessen or eliminate the above-described problem endemic to literal text-based searches. However, if one desires to search for information on lions within the subcategory “science/biology/zoology/mammals”, and if no explicit “lions” subcategory exists, one must manually search through all document titles under that subcategory looking for documents related to lions.
What is needed is a mechanism for rapidly determining, for each result of a text-based search, whether the indicated result is listed in a particular category directory representing a desired subject area.
Alternatively, there may be other applications in which it is useful to rapidly determine whether or not an item exists in a stored file. In some situations, the existence of the item is known, but the location may be unknown. In other situations, it may be unknown whether or not the item exists.
Conventional data storage techniques such as sorting, binary tree searching, or traversing may be successful in performing the desired identifying and locating operations, but are often too slow for effective use in real-time environments.
One known technique for reducing search time is hashing, as described in D. Knuth, The Art of Computer Programming, vol. 3, Addison-Wesley: 1973. A hash table is constructed for storing pointers to a master file containing the stored items. The hash table can be of arbitrary size, and contains some number of buckets, each bucket containing some number of entries. For example, 216, or 65,536 buckets may be included, each bucket containing up to 32 entries. Each entry is typically a fixed-length, for example 32-bit, pointer to a specific location in a master file containing stored items.
When an item is added to the master file, a pointer to the item is added to the hash table, as follows. A hash function is applied to the item to obtain a hash key. The hash function may be any operation that may be performed on the item, and that preferably results in a relatively even distribution of items among all buckets in the hash table. For example, one such hash function is to perform successive exclusive OR operations on the characters forming the character string of the item. A particular bucket is identified by the obtained hash key, and an entry containing a pointer to the item in the master file is added to the identified bucket.
In order to determine whether a particular item exists in the file, the hash function is applied to the search term in order to identify a bucket. The identified bucket is then traversed. For each item in the bucket, the referenced location in the master file is consulted and the stored item is compared to the search term. If a match is found, the traversal ends and a positive result is returned. If the location of the item is desired, it may also be returned. If all items in the bucket are checked without finding a match, a negative result is returned.
The above-described conventional technique for identifying the existence and location of an item in a file is relatively slow because it requires a relatively large number of reads from the hash table and from the master file. On average, a positive result requires 1+N/2 reads, where N is the average number of entries in each bucket (one read from the hash table, and a number of reads from the master file equal to half the size of the identified bucket before the match is found). A negative result requires that all the entries in the bucket be consulted, and therefore requires an average of 1+N reads. The large number of reads required to implement a conventional hash table for determining the existence of an item severely impacts the performance of a system employing this technique.
In addition, the above-described technique does not allow for optimized or improved performance in certain special cases, such as where the existence of an item is known but not its location. Whether or not such existence is known, the same traversal operations must be performed as described above.
What is needed is a system and method for determining the existence of an item in a file in a rapid and efficient manner. In addition, what is needed is a system and method for determining the location of an item in a file in a rapid and efficient manner, and which is capable of being optimized for improved performance in special cases.