The invention relates to a method of searching among present values in a list with a monotonous ordering for a plurality of entered values, chunks of the list being labeled by respective chunk labels being stored in a sparse index and being representative of respective ranges of the present values contained in the respective corresponding chunks, the method comprising the steps of:
(1) searching the sparse index for a particular hit chunk label representative of a range of values to which a particular entered value belongs; PA1 (2) making a particular hit chunk corresponding to the particular hit chunk label accessible; PA1 (3) searching the particular hit chunk for the particular entered value. PA1 (1) a background memory for storing a list of present values with a monotonous ordering; PA1 (2) a foreground memory for storing a sparse index to the list and for providing chunk buffers for keeping chunks of the list; PA1 (3) a query control section for receiving a plurality of entered values; PA1 (4) transfer means for transferring hit chunks from the background memory to the foreground memory; PA1 (5) search means for searching in the foreground memory for particular present values matching the entered values. PA1 any present value in the list needs to be inspected at most once, because a next entered value to be searched for will occur at or after that position, as it occurs later in the ordering; PA1 the hit chunks are searched in the order in which they are present in the list. When the chunks are stored in a storage medium comprising mechanical positioning means, such as a hard disk, the necessary transfer of the hit chunks to the main memory for the actual searching requires minimal accumulated positioning distance and therefore minimal accumulated positioning time;
The invention further relates to a system comprising:
The method as defined in the preamble is used in several systems. One can think of a spelling checker which checks whether entered values (words) being part of a text are contained in a list of present values (dictionary) in the system. In this example, the list contains no information but the present values themselves, this being sufficient for the particular function performed by the spelling checker. The monotonous ordering of the list amounts to an arrangement of the constituting present values being such that each successive present value can be said to be, in some sense, larger (or smaller) than its predecessor. For strings, an alphabetical ordering is an example of such an ordering.
In other systems in which the method is used, further information about a present value is stored either along with the present value in the list itself, thus forming a complete data item, or in one or more different locations elsewhere in a storage hierarchy, the present value in the list being accompanied by one or more pointers or other reference means to these particular locations. Also combinations of these two options are possible.
An example of such a system is a patent retrieval system. In such a system one can enter keywords and serial numbers of a number of relevant patent specifications, if any, are output, together with abstracts thereof or even the patent specifications themselves. Referring to the preamble, in this example the present values of the list are the keywords available in the system, and the entered values are the entered keywords. The sets formed by the present and the entered values can be fully or partly overlapping or else, the sets can be non-overlapping. Each present value is accompanied by one or more pointers to storage locations where relevant patent specifications are stored, or is accompanied by a small list of serial numbers of relevant patent specifications. Numerous other data structures can be envisaged.
In any such system, the list provides a kind of main index to the data items forming a data set. Therefore, when dealing with such systems we will use the term "main index" instead of the more general term "list". The main index can be constructed by extracting the present values from the data items automatically when the latter are entered in the system or by entering them separately by hand. The main index provides the system the opportunity to, in response to an entered value, determine the presence of relevant data items by searching the main index, rather than to parse the data items themselves. This is particularly advantageous when the data set is large.
In both the spelling checker and the patent retrieval system, the process of searching the list is substantially accelerated by using the sparse index. Each chunk label in the sparse index represents a chunk of the list. Moreover, from each chunk label it can be deduced what range of present values is contained in the corresponding chunk. With the sparse index, preferably being small enough to be contained in primary memory, it is determined in which chunk of the list the entered value should be located, provided it is available. Subsequently, only that "hit chunk" has to be searched. This is especially advantageous when the list is too large to be contained in a primary memory and resides in a slower secondary memory, such as disk. Then only the hit chunk has to be transferred from the secondary to the primary memory. Of course, this two-level sparse index based data structure, for ease of reference called lexicon in the sequel, could be extended to a tree of an arbitrary number of levels. Also, the mapping of the lexicon on the storage hierarchy can be done in various alternative ways, in which advantageously but not necessarily higher levels of the lexicon are kept on faster accessible storage media than lower levels.
Looking up a single entered value in the lexicon involves a number of time consuming operations. Loading (parts of) the list into primary memory is one delay factor, comprising the setting up of the transfer (e.g. head positioning and rotational latency of a hard disk drive) and the actual transfer itself. Another delay factor is the search process itself. When a linear search algorithm is applied (which is the most straightforward solution for variable length present values), on average half the sparse index and half the hit chunk have to be searched.
Looking up a plurality of entered values is done by looking up each entered value individually. Consequently, the total search time is proportional to the number of entered values.