1. Field of the Invention
The present invention relates generally to data storage and retrieval techniques, and more particularly to a system and method for rapidly identifying the existence and location of an item in a file.
2. Description of Background Art
In many computer-related applications, it is useful to rapidly identify whether or not a particular item exists in a stored file, database, or table. For example, one such application involves an implementation of a content directory of World Wide Web sites, including listings of Uniform Resource Locators (URLs) identifying on-line documents. It may be useful for a user or automated software application to identify whether or not a particular URL is listed in a particular content directory. Mechanisms for searching multiple pieces of text-based information in a document space such as the World Wide Web often take one of two types. The first type of search mechanism involves providing a text string to a search engine, which then retrieves a descriptor or identifier for any document containing the specified text string. Various combinations of text-strings and Boolean operators may be provided to implement more complex searches. However, the literal nature of such text-based searches often results in retrieval of documents that are unrelated to the intended meaning or context of the search terms. For example, a search for information on lions using the word xe2x80x9clionxe2x80x9d as a search term may result in retrieval of documents describing the motion picture xe2x80x9cThe Lion Kingxe2x80x9d, community service clubs such as xe2x80x9cLion""s Clubxe2x80x9d, and other documents unrelated to the intended object of the search.
The second type of search mechanism is a category search, which employs a category directory describing a hierarchy of information categories. The search is performed by traversing the hierarchy to successively narrower categories until the desired set of documents is reached. Therefore, a search for information on lions might begin with a broad category of xe2x80x9csciencexe2x80x9d, then proceed down the hierarchy to xe2x80x9cbiologyxe2x80x9d, xe2x80x9czoologyxe2x80x9d, xe2x80x9cmammalsxe2x80x9d, and so forth. This approach tends to lessen or eliminate the above-described problem endemic to literal text-based searches. However, if one desires to search for information on lions within the subcategory xe2x80x9cscience/biology/zoology/mammalsxe2x80x9d, and if no explicit xe2x80x9clionsxe2x80x9d subcategory exists, one must manually search through all document titles under that subcategory looking for documents related to lions.
What is needed is a mechanism for rapidly determining, for each result of a text-based search, whether the indicated result is listed in a particular category directory representing a desired subject area.
Alternatively, there may be other applications in which it is useful to rapidly determine whether or not an item exists in a stored file. In some situations, the existence of the item is known, but the location may be unknown. In other situations, it may be unknown whether or not the item exists.
Several search techniques exist in the prior art for determining whether a particular record is stored in a master file, and obtaining the address of location where the record is stored. For example, the master file may be traversed in its entirety, or it may be sorted, or a binary tree search may be performed. Such techniques are time-consuming, and may involve excessive overhead in maintaining the master file.
One known technique for reducing search time is hashing, as described in D. Knuth, The Art of Computer Programming vol. 3, Addison-Wesley: 1973. Referring now to FIG. 2, there is shown a block diagram of a hash table architecture according to the prior art. Master file 205, which is typically stored in a data storage device such as a hard drive or other long-term storage, contains a number of data records 223, 224, 225, 226, 227, 228. Records 223, 224, 225, 226, 227, 228 contain any type of information that may be retrieved for use by a user or by a computer system. Each record 223, 224, 225, 226, 227, 228 is stored at a particular location having a specific address, so that a record may be retrieved from master file 205 in a conventional manner by reference to the address of the record. Any number of records 223, 224, 225, 226, 227, 228 may be included in master file 205.
Hash table 204 is constructed and stored, for example in data storage such as a hard drive or other storage device. Hash table 204 can be of arbitrary size, and contains some number of hash buckets 211, 212, 213, each bucket containing some number of entries. Each entry contains a fixed-length, for example 32-bit, pointer 217, 218, 219, 220, 221, 222 to an address indicating a particular location in master file 205. In the example of FIG. 2, pointer 219 points to the address of a location in master file 205 containing record 225, while pointer 220 points to the address of a location in master file 205 containing record 227.
Any number of hash buckets 211, 212, 213 may be provided in hash table 204, and any number of entries, or pointers 217, 218, 219, 220, 221, 222 can be provided in each hash bucket 211, 212, 213. For example, 65,536 buckets 211, 212, 213 may be included, each bucket containing up to 32 entries.
Each hash bucket 211, 212, 213 is associated with a hash key 214, 215, 216 that can be obtained by applying hash function 202 to an item to be stored or retrieved. Hash function 202 may be any operation that can be performed on the item, and preferably is an operation that results in a relatively even distribution of items among buckets 211, 212, 213 in hash table 204. For example, one such hash function 202 involves performing successive exclusive-OR operations on the characters forming the character string of the item. This results in a 16-bit hash key that is capable of uniquely identifying 216, or 65,536 different hash buckets 211, 212, 213.
When a new record containing an item is added to master file 205, a pointer to the record is added to hash table 204. The pointer is added to the appropriate hash bucket, determined by applying hash function 202 to the value of the new item. The new pointer in the hash bucket contains an address indicating the location in master file 205 of the new item.
In order to determine whether a particular item exists in master file 205, a search term 201 is supplied containing a text string or other identifier for the desired record. In the example of FIG. 2, search term 201 indicates the data represented by record 227. Hash function 202 is applied to search term 201 in order to obtain hash key 203. Hash bucket 212 containing the identical key 215 to the obtained hash key 203 is identified.
Bucket 212 is then traversed. For each item in bucket 212, the referenced location in master file 205 is consulted and the stored item is compared to search term 201. If a match is found, the traversal ends and a positive result is returned. If the location of the item is desired, it may also be returned. If all items in bucket 212 are checked without finding a match, a negative result is returned.
Thus, in the example of FIG. 2, pointer 219 is dereferenced and the corresponding record 225 in master file 205 is consulted. Record 225 is compared with search term 201, and no match is found. Pointer 220 is then dereferenced and the corresponding record 227 in master file 205 is consulted. Record 227 is compared with search term 201, and a match is found. A positive result is returned, along with the address of record 227 or the data contained therein, as desired.
The prior art technique of FIG. 2 for identifying the existence and location of an item in a file is relatively slow because it requires a relatively large number of reads from hash table 204 and from master file 205. For a worst-case positive result, all pointers in the identified bucket must be dereferenced and compared with search term 201 before a match is found. For an average positive result, half of the pointers in the identified bucket must be dereferenced and compared. Therefore, on average, a positive result requires one read from hash table 204, plus N/2 reads from master file 205, where N is the average number of entries in each hash bucket of hash table 204. For a negative result, all pointers in the bucket must be consulted in order to rule out a match, so that a negative result requires an average of 1+N reads. The large number of reads required to implement a conventional hash table for determining the existence of an item severely impacts the performance of a system employing this technique.
In addition, the above-described technique does not allow for optimized or improved performance in certain special cases, such as where the existence of an item is known but not its location. Whether or not such existence is known, the same traversal operations must be performed as described above.
What is needed is a system and method for determining the existence of an item in a file in a rapid and efficient manner. In addition, what is needed is a system and method for determining the location of an item in a file in a rapid and efficient manner, and which is capable of being optimized for improved performance in special cases.
The present invention provides a system and method of identifying the existence and location of an item in a file in a rapid and efficient manner. The present invention minimizes the number of reads that are performed when identifying such information. In addition, the present invention is capable of being optimized for improved performance in special cases, such as when the existence of an item is known and its location is sought.
A hash table is constructed of arbitrary size, containing some number of buckets, each bucket containing some number of entries. Each entry contains two portions, including a first portion containing a pointer to a specific location in a master file containing stored items, and a second portion containing a value of a secondary hash function, as will be described below. This secondary hash function is employed to rapidly determine whether an item exists in the file and identifying the location of the item without requiring an undue number of reads from the master file.
When an item is added to the master file, a pointer to the item is added to the hash table, as follows. A primary hash function is applied to the item to obtain a primary hash key. A particular bucket is identified by the obtained primary hash key, and an entry containing a pointer to the item in the master file is added to the identified bucket.
A secondary hash function is applied to the item to obtain a secondary hash key. The secondary hash function is preferably independent of the primary hash function. The secondary hash key is stored in the hash table as a second portion of the hash table entry.
In order to determine whether a particular item exists in a master file, the primary hash function is applied to the search term to identify a bucket. The secondary hash function is applied to the search term and the determined secondary hash key is compared with the secondary hash keys for the entries stored in the identified bucket. If no match is found, a negative result is obtained. If one or more matches is found, the master file is consulted for each of the matches and the stored item is compared to the search term. The master file need not be consulted for nonmatching items, since it is known that such items do not match the search term. Since the number of matches is generally relatively small compared to the size of the entire bucket, the number of reads from the master file is significantly reduced as compared to the prior art scheme described previously.
Furthermore, in certain special cases the system and method of the present invention may return a location of an item without consulting the master file at all. Specifically, if an item is known to exist in the master file, and its location is sought, and if comparison of the secondary hash key results in a single match in the identified bucket, the single match is known to contain the desired location, and the location may be returned without consulting the master file.
Therefore, the system and method of the present invention substantially reduce the number of reads that are performed in identifying the existence and/or location of an item in a file, and thereby improve efficiency and speed of operations using such identifications.
The system and method of the present invention are capable of application to many different types of operations. One such application is to perform a contextual text search, such as for example the identification of URLs falling within the intersection of a full-text search and a category of a content directory. A full-text search may be performed on a search term, and each result can be checked against a category of a content directory using the hashing techniques of the present invention. In this manner, a rapid determination can be made as to the existence and location of URLs falling within the intersection of the full-text search and the specified category of the content directory.