The present invention relates to search techniques and more particularly to techniques that enable searches to be performed in an efficient manner while minimizing the memory resources required to perform the searches.
Searching is an important and extensively used operation in computer applications. For example, a list of files on a file server may be searched to determine if the list includes a file with a specific filename, a list of uniform resource identifiers (URIs) may be searched to determine if a user-specified URIC is in the list, a list of available resources may be searched by an access control application to locate a resource and to determine access rights associated with the resource, a file's contents may be searched to determine if a particular keyword is included in the contents, and several other applications.
One sector that has seen a heightened demand for efficient search techniques is the area of electronic commerce activities. Merchants and other entities who provide online commercial services need to use fast and efficient search techniques to be able to respond to customer requests in a timely manner. In order to maximize their profits, the on line merchants also prefer to use search techniques that require minimal amounts of memory and computing resources to perform the searches so as to minimize costs associated with the searches. For example, online banking institutions and credit card companies who authorize payments for online commerce activities need to use efficient search techniques to process consumer requests in a timely manner while minimizing costs associated with the searches. Accordingly, there is an increasing demand for search techniques which perform searches in a timely manner while using minimal memory and computing resources to perform the searches.
There are a number of different approaches to searching. According to one approach, searching can be modeled as follows: given a set S comprising “n” elements “k1, k2, . . . kn” (i.e., S=(k1, k2, . . . , kn)) elements from some domain Σ, and a target or query element k from domain Σ (i.e., k ε Σ), searching is a process that determines if target element k is included in set S (i.e., if k ε S ). The searching process might also include processing to determine the location of the target element in Σ. Domain Σ can be any arbitrary domain, e.g., the set of integers, the set of real numbers, a set of strings of characters, etc. The set S might manifest itself in various forms, for example, set S might be a collection of files forming a file system, a list of URIs, a list of resources, etc. Each element kn of set S may comprise one or more characters from a character set of domain Σ. Search techniques typically attempt to minimize the time and processing resources needed to determine if k ε S.
One method of measuring the efficiency of a search technique is to determine the number of comparisons needed by the search technique to determine if a query element k is included in set S. Since each comparison requires a specific unit of time to be performed, search techniques strive to reduce the number of comparisons required to determine if a query element k is included in set S. In general, the term “comparison” may refer to comparing any two values. A value may correspond to an element of domain Σ comprising one or more characters, a character of an element of domain Σ, and the like. Accordingly, a comparison that compares an element of domain Σ with another element of domain Σ is referred to as an “element comparison.” A comparison that compares a character of an element of domain Σ with a character of another element is referred to as a “character comparison.” An element comparison may involve one or more character comparisons. For example, when a first element is compared with a second element, the comparison may compare individual characters of the first element with characters of the second element. Since each element of Σ can be of arbitrary length (i.e., have variable number of characters), each comparison may require more than a “unit of time” to perform the comparison.
Several conventional search techniques have been developed to solve the search problem. According to one brute-force search technique, the query element k is compared with every element in set S. This technique may require up to “n” element comparisons to perform the search, where n is the number of elements in set S. Accordingly, if n is very large (which is quite often the case), the runtime performance of such a search technique is not very optimal.
Several other conventional search techniques require that the set S be in sorted order. For example, a binary search technique may be used to determine if k ε S provided that the elements of S are in sorted order. The binary search techniques require Θ(log n) element comparisons to complete the search (where n is the number of elements in set S). However, the application of such search techniques is quite limited because of the pre-requisite that the set of elements to be searched has to be in sorted order. The costs involved in keeping a data set in a sorted order add to the overall cost of the search and render the use of such search techniques impractical in many applications (especially in applications where the data set to be searched is large and there are frequent additions and deletions of elements from the data set, e.g., applications in an electronic commerce environment). As a result, the use of such search techniques is limited.
Other search techniques are based upon the assumption that Σ is appropriately restricted, or that set S has a certain distribution on Σ, etc. For example, if Σ={1, 2, . . . , N}, the search technique (referred to as “interpolation search”) disclosed in “P. van Emde Boas, R. Kaas, and E. Zijlstra, Design and Implementation of an Efficient Priority Queue, Mathematical Systems Theory 10, 1977, pp. 99-127” can perform the search in O(log log N) time using O(n) total memory. The van Emde et al. technique uses a dictionary (i.e., a data structure that supports insert, delete, and search operations) where each operation takes O(log log N) time. For example if Σ is the domain of all character strings of length at most 150 (and assuming there are 50 characters in the character set for Σ), then the number of comparisons required to perform the search will be at least (log log 50150=10) comparisons, i.e., the search time will be at least 10 comparisons. If it is assumed that set S is uniformly distributed in (0, 1), the van Emde et al. technique (also referred to as an “interpolation search technique”) can search in an expected O(log log n) time. However, a disadvantage of these search techniques is that they cannot be applied to any arbitrary domain Σ. Further, these techniques require substantial memory resources to perform the search, and as a result are not very cost effective when the data set to be searched is large.
Dictionaries may also be defined such that only one comparison is required to determine if k ε S. For example, if Σ={1, 2, . . . , N} (i.e., |Σ|=N; domain Σ comprises N elements). If we have a memory of size Ω(N), then a dictionary may be implemented as follows. Label the elements of Σ as 1, 2, . . . , N. An array A[I:N] (i.e., an array “A” comprising N elements) may be configured such that A[b] corresponds to element b in domain Σ. Initially, all the array locations are initialized to zero. Then, for every element ki in set S, A[ki] is set to 1 for 1≦i≦n (where n is the number of elements in set S). A determination if k ε S may then be performed by determining if A[k]=1 (which indicates presence of the element). While this type of dictionary can accomplish the search in O(1) time, the memory resources required for this technique can be very large, especially if N is large. For example if Σ is the domain of all character strings of length at most 150, then N will be 50150 assuming that there are 50 different characters. Accordingly, while the run time performance of such a technique is very optimal, the vast amounts of memory resources required by this technique make it impracticable for most applications.
In light of the above, there is a need for search techniques which can perform searches in an efficient manner while minimizing the memory resources required to perform the searches.