The problem of searching has been studied extensively and the available literature on searching is vast. There are many variations of the search problem dependent on the domain under concern, the type of queries, the application environment, etc. Even though a number of general techniques are available for searching, these may not yield optimal performance for specific applications. It is possible to develop application-specific techniques whose performance exceeds than those of the generic techniques. In this document we consider the problem of searching where the domain under concern is the set of (possibly very large) integers.1 Integer searching is useful in a wide variety of environments including, without limitation, searches for credit card numbers (e.g., for validity checking during a purchase transaction), employee identifiers (e.g., for payroll or other corporate databases), customer identifiers, date searches (e.g., for calendars, computer file management systems), parts number searches (e.g., for inventory control), etc. 1As used throughout this patent, integer searching should be understood to include not only direct integer searching (i.e., where the data to be searched is originally represented in integer form) but also indirect integer searching (i.e., where the data to be searched can be converted into integer form, for example and without limitation, by assigning an integer equivalent to one or more alphabetical characters, etc.). It should also be understood that the term integer is not limited to base-10 systems, but is applicable to any other base. For example, a base-10 system would be described using the digits 0–9, while a base-16 system would be described using the digits 0–9 and A–F.
In general, let X represent the data set (or, equivalently, database) to be searched, and x the value to be searched for (e.g., a target value). The problem is to check if x is in X. One consideration of searching techniques is the amount of time required to complete the search.
If X is represented as an array a[1:n] (i.e., where n is the size of X), one known (simple) technique scans through every element of X to see if x can be matched with an element of X Such a technique takes linear time.
Alternatively, if the array is represented in sorted order, one could employ known binary search techniques (see, e.g., E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms, W. H. Freeman Press, 1998) to perform the search in a manner requiring logarithmic time—an improvement over linear time.
Also consider a case where the elements of X can change dynamically (i.e., insertions and/or deletions can happen in X). In such a case, one must also consider the time required for the insert/delete operations. The insert/delete operations include both searching for a location at which the insert/delete is to be performed, as well as the actual insert/delete operation. As stated above, if X is represented as a sorted array, one can achieve logarithmic time performance for the search itself. What about the insert/delete operations? Assuming that the array representing X is of size n, each insert/delete operation will take linear time (i.e., O(n), where O represents an asymptotic upper bound)—worse than logarithmic time. This scheme will be satisfactory when the number of insert/delete operations performed is small, so that the linear time required therefor does not overwhelm the logarithmic time required for search. If this is not the case, one can employ balanced data structures such as a red-black tree, a 2-3 tree, etc. It is known that such data structures accommodate operations, including insert, delete, search, find-min, and find-max, such that each operation takes only O(log n) time to perform (see, e.g., Horowitz et al., supra, or T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, MIT Press, 1991). Thus, the overall time requirement remains logarithmic.
If the domain under concern is the set of integers, it is also known that one can employ hashing techniques to solve the search problem. In particular, assume that the elements are integers in the range [1,N].2 A trivial form of hashing (in which the hashing function is simply the identity function) works as follows. One stores the elements of X as an array a[1:N] of lists (or, equivalently, lists). In the list a[1] one stores all the elements of X that have a value of 1, in a[2] one stores all the elements of X that have a value of 2, etc. If X has no repetition of elements, then each list of a will have either no elements or only one element. Thus, to check if x is in X, one simply has to search for x in the corresponding list a[x] and hence this will take only O(1) time. Insertions and/or deletions also takes O(1) time. Thus, both searching and insert/delete operation times here represent a significant improvement over both linear and logarithmic times. 2Or, equivalently, [0, N−1] if one begins counting from zero. The choice of whether to start counting from 0 or 1 is a matter of choice in an actual implementation.
Besides time requirements, computer implementations of searching techniques also face computer memory limitations. That is, the elements, lists and arrays must be accommodated within the available memory of the computer system. In the trivial hashing scheme mentioned above, the memory required is known to be no more than O(n+N). Where N (and thus n+N) is small, this scheme will work fine. But where N is large, there may not be enough memory in the computer. For example, if the elements of X are 16-digit base-10 numbers, then N=1016, and one will need space for 1016 pointers (among other things). This implies (assuming that each pointer is 4 bytes) a memory requirement of at least 4×1016 bytes, i.e., 4×107 GB—which exceeds the available memory on many currently available computer systems.
In cases where N (and thus n+N) is prohibitively large, one can use the following, more common, form of hashing. Assume that there is space for an array of L lists, i.e., the memory available is at least Ω(L+n) (where Ω represents an asymptotic lower bound). Then we choose a function (called the “hash function” ) h: Σ→[1:L], where Σ represents the domain, and [1:L] represents the range, of the function h. If y is an element of X, it will be stored in the list a[h(y)]. The lists now can have any number of elements from zero to n (n would represent the extreme case where one list contains all the elements, and all other lists are empty), where n is the size of X. To search for a given value x, one scans (perhaps sequentially) through the list a[h(x)] of all elements of X that hash to the same value, h(x), as x. Assuming one has used sequential searching, the search time will be proportional to the number of elements in this list, and the search time in the worst case can be Ω(n).3 The same is true for insert/delete operations. If there are two elements in the same list of a, these two elements are said to collide. In general there could be any number of collisions. On the other hand, the expected number of elements in any list can be seen to be n/L (under uniformity assumptions in the input space). Thus the expected time needed to perform any of the operations under concern is only 0(1) assuming that L≧n. 3Other search techniques (e.g., 2-3 trees, etc.) could yield better (e.g., logarithmic) search times.
The performance of the hashing method described above very much depends on the characteristics of the chosen hash function h. For example, if the hash function is overly complex, the computer system might take a very long time to compute h(x). If the hash function spreads the elements of X across the lists of a relatively evenly, this will have the effect of reducing the searching time.
We disclose herein various embodiments and aspects of an improved search scheme, based on a new hashing technique, that can be tailored to strike a desired balance between the computer memory available and the search time required.