In an article entitled Search Day-Closer Closer Closer dated May 22, 2001, Chris Sherman, Associate Editor of www.searchenginewatch.com, characterized a key dilemma of persons using computer-based search engines to find meaning:
Both phrase searching and the NEAR operator offer simple but effective ways to take more control over your search. They only work in rather restricted circumstances. But they can often mean the difference between finding what you're looking for and wishing your results were closer, closer, closer.
Unfortunately, most search engines never had or have abandoned the “near” option. According to Chris Sherman, AltaVista appears alone among the major search engines in offering this capability. As of February 2004, if one experiments with the AltaVista advanced search option, one finds that no control is offered to the searcher over what is meant by the operator “NEAR.”
Occasional references can be found to proximity search systems that allow the user to specify within how many words of each other that specified words are to be found. See for example, http://www.wcsu.ctstateu.edu/library/odlis.html#proximity:
The search software of some bibliographic databases allows a proximity operator to be used in search statements to specify that a record will be retrieved only if the keywords typed as search terms appear within a designated number of words of each other, or within the same sentence or paragraph. The proximity operator is not standardized (in some databases it is ‘adj’ for adjacent to, in others it is ‘w’ for with).
Example: publication adj1 date or publication w1 date
In the example given above, the query will retrieve records in which the word ‘publication’ appears within one word of ‘date,’ for example, records containing the phrase date of publication or publication date (or both) and also date for publication, publication and date, publication to date, etc.
These bibliographic search systems function effectively only across relatively small databases. They also tend to be table driven. They offer no solution for search across large sets of text data.
These examples are representative of the dilemma faced by persons searching in large quantities of text. Four problems face the user of conventional search engines for large text data sets. (a) FALSE NEGATIVES: The only proximity control commonly available is to specify “exact phrase”; this results in relevant material that is missed because it contains intervening words or the words are not in the precise order specified. For example, a search for the exact phrase “that they may be one” would miss inexact phrases such as “that they may all be one” or “may be that they should become one”. (b) GOOD FINDINGS BURIED AMONG AN OVERWHELMING NUMBER OF FALSE POSITIVES: A request for “all the words” results in an avalanche of unhelpful results in which the specified search terms are widely spread out and do not relate to the meaning desired by the searcher. (c) LACK OF CONTROL OVER ADJACENCY: In search engines in which a “near” option is offered, the searcher may have limited control over the extent of adjacency, and presentation of the result of the search may be poorly ranked or (worse) inaccurate. (d) DELAY: There is an inherent problem in proximity search, not solved prior to the instant invention—delay in the reporting of results caused by the requirement in older methods for a multiplicity of disk seeks in order to verify the actual distance between words in each and every record. These disk seeks also multiply the computing resources required to satisfy search requests.
The normal tradeoff in search across large quantities of text is to sacrifice accuracy for speed. Current search engines generally are very fast at finding records, but are woefully deficient at checking how close words are together in a record. The older “proximity search”methods would fetch every record that held the desired words at any distance, and check proximity; this process used excessive computer resources. Hence without proximity control the user is left with the job of pouring through all the records found. To save resources, and thereby to annoy people, is a poor solution to a problem that confronts literally millions of computer users every day.
Without a solution to the technical problem underlying computation of adjacency, the problem will worsen. We are experiencing exponential growth in the world supply of searchable data.
None of compression, indexing, or electronic books (ebooks) are new. However, the combination of the three in a system to enable efficient proximity search across large and growing quantities of text is novel.
What is needed, and desirable, is a system and method of response to text search requests which can return nearly immediate responses and apply relevance ranking based on closeness of fit (the best surrogate for meaning) to large quantities of text. One hundred percent compression of the text (i.e., eliminating the text) combined with other techniques would convey additional benefits in size of data handled, responsiveness in terms of speed of search, portability of results, fully formatted browsing, and convenience.