1. Field of the Invention
The present invention relates generally to data storage and retrieval techniques, and more particularly to a system and method for rapidly identifying the existence and location of an item in a file.
2. Description of Background Art
In many computer-related applications, it is useful to rapidly identify whether or not a particular item exists in a stored file, database, or table. For example, one such application involves an implementation of a content directory of World Wide Web sites, including listings of Uniform Resource Locators (URLs) identifying on-line documents. It may be useful for a user or automated software application to identify whether or not a particular URL is listed in a particular content directory. Mechanisms for searching multiple pieces of text-based information in a document space such as the World Wide Web often take one of two types. The first type of search mechanism involves providing a text string to a search engine, which then retrieves a descriptor or identifier for any document containing the specified text string. Various combinations of text-strings and Boolean operators may be provided to implement more complex searches. However, the literal nature of such text-based searches often results in retrieval of documents that are unrelated to the intended meaning or context of the search terms. For example, a search for information on lions using the word "lion" as a search term may result in retrieval of documents describing the motion picture "The Lion King", community service clubs such as "Lion's Club", and other documents unrelated to the intended object of the search.
The second type of search mechanism is a category search, which employs a category directory describing a hierarchy of information categories. The search is performed by traversing the hierarchy to successively narrower categories until the desired set of documents is reached. Therefore, a search for information on lions might begin with a broad category of "science", then proceed down the hierarchy to "biology", "zoology", "mammals", and so forth. This approach tends to lessen or eliminate the above-described problem endemic to literal text-based searches. However, if one desires to search for information on lions within the subcategory "science/biology/zoology/mammals", and if no explicit "lions" subcategory exists, one must manually search through all document titles under that subcategory looking for documents related to lions.
What is needed is a mechanism for rapidly determining, for each result of a text-based search, whether the indicated result is listed in a particular category directory representing a desired subject area.
Alternatively, there may be other applications in which it is useful to rapidly determine whether or not an item exists in a stored file. In some situations, the existence of the item is known, but the location may be unknown. In other situations, it may be unknown whether or not the item exists.
Several search techniques exist in the prior art for determining whether a particular record is stored in a master file, and obtaining the address of location where the record is stored. For example, the master file may be traversed in its entirety, or it may be sorted, or a binary tree search may be performed. Such techniques are time-consuming, and may involve excessive overhead in maintaining master file.
One known technique for reducing search time is hashing, as described in D. Knuth, The Art of Computer Programming, vol. 3, Addison-Wesley: 1973. Referring now to FIG. 2, there is shown a block diagram of a hash table architecture according to the prior art. Master file 205, which is typically stored in a data storage device such as a hard drive or other long-term storage, contains a number of data records 223, 224, 225, 226, 227, 228. Records 223, 224, 225, 226, 227, 228 contain any type of information that may be retrieved for use by a user or by a computer system. Each record 223, 224, 225, 226, 227, 228 is stored at a particular location having a specific address, so that a record may be retrieved from master file 205 in a conventional manner by reference to the address of the record. Any number of records 223, 224, 225, 226, 227, 228 may be included in master file 205.
Hash table 204 is constructed and stored, for example in data storage such as a hard drive or other storage device. Hash table 204 can be of arbitrary size, and contains some number of hash buckets 211, 212, 213, each bucket containing some number of entries. Each entry contains a fixed-length, for example 32-bit, pointer 217, 218, 219, 220, 221, 222 to an address indicating a particular location in master file 205. In the example of FIG. 2, pointer 219 points to the address of a location in master file 205 containing record 225, while pointer 220 points to the address of a location in master file 205 containing record 227.
Any number of hash buckets 211, 212, 213 may be provided in hash table 204, and any number of entries, or pointers 217, 218, 219, 220, 221, 222 can be provided in each hash bucket 211, 212, 213. For example, 65,536 buckets 211, 212, 213 may be included, each bucket containing up to 32 entries.
Each hash bucket 211, 212, 213 is associated with a hash key 214, 215, 216 that can be obtained by applying hash function 202 to an item to be stored or retrieved. Hash function 202 may be any operation that can be performed on the item, and preferably is an operation that results in a relatively even distribution of items among buckets 211, 212, 213 in hash table 204. For example, one such hash function 202 involves performing successive exclusive-OR operations on the characters forming the character string of the item. This results in an 16-bit hash key that is capable of uniquely identifying 2.sup.16, or 65,536 different hash buckets 211, 212, 213.
When a new record containing an item is added to master file 205, a pointer to the record is added to hash table 204. The pointer is added to the appropriate hash bucket, determined by applying hash function 202 to the value of the new item. The new pointer in the hash bucket contains an address indicating the location in master file 205 of the new item.
In order to determine whether a particular item exists in master file 205, a search term 201 is supplied containing a text string or other identifier for the desired record. In the example of FIG. 2, search term 201 indicates the data represented by record 227. Hash function 202 is applied to search term 201 in order to obtain hash key 203. Hash bucket 212 containing the identical key 215 to the obtained hash key 203 is identified.
Bucket 212 is then traversed. For each item in bucket 212, the referenced location in master file 205 is consulted and the stored item is compared to search term 201. If a match is found, the traversal ends and a positive result is returned. If the location of the item is desired, it may also be returned. If all items in bucket 212 are checked without finding a match, a negative result is returned.
Thus, in the example of FIG. 2, pointer 219 is dereferenced and the corresponding record 225 in master file 205 is consulted. Record 225 is compared with search term 201, and no match is found. Pointer 220 is then dereferenced and the corresponding record 227 in master file 205 is consulted. Record 227 is compared with search term 201, and a match is found. A positive result is returned, along with the address of record 227 or the data contained therein, as desired.
The prior art technique of FIG. 2 for identifying the existence and location of an item in a file is relatively slow because it requires a relatively large number of reads from hash table 204 and from master file 205. For a worst-case positive result, all pointers in the identified bucket must be dereferenced and compared with search term 201 before a match is found. For an average positive result, half of the pointers in the identified bucket must be dereferenced and compared. Therefore, on average, a positive result requires one read from hash table 204, plus N/2 reads from master file 205, where N is the average number of entries in each hash bucket of hash table 204. For a negative result, all pointers in the bucket must be consulted in order to rule out a match, so that a negative result requires an average of 1+N reads. The large number of reads required to implement a conventional hash table for determining the existence of an item severely impacts the performance of a system employing this technique.
In addition, the above-described technique does not allow for optimized or improved performance in certain special cases, such as where the existence of an item is known but not its location. Whether or not such existence is known, the same traversal operations must be performed as described above.
What is needed is a system and method for determining the existence of an item in a file in a rapid and efficient manner. In addition, what is needed is a system and method for determining the location of an item in a file in a rapid and efficient manner, and which is capable of being optimized for improved performance in special cases.