The development of efficient and effective text retrieval techniques is critical to managing the increasing amount of textual information available in electronic form. Until recently, information retrieval involved relatively small collections of machine readable text in the range of 100 megabytes of data. Networks bring together collections of information in the gigabyte range, and the increased amount of data makes the retrieval process more difficult.
There are two main tasks for text retrieval systems: (1) Identification of terms in documents that should be included in the text index; and (2) After text indexing the terms, determining that a document matches a query. Conventional text retrieval techniques rely on text indexing keywords in documents. Text index terms can be from single words, noun phrases, and subject identifiers derived from syntactic and semantic analysis.
Conventional text retrieval systems for the World Wide Web, such as Yahoo!™ from Yahoo! Inc. and AltaVista™ from Digital Equipment Corporation, use these and other types of keyword text indexing techniques to text index documents available on the web.
Yahoo!, AltaVista, and other conventional text retrieval systems for the web employ programs called “web crawlers” to traverse the web. Web crawlers follow links from page to page and extract terms from all the pages that they encounter. Each search engine then makes the resulting information accessible by providing lists of specific pages that match an input search request or query.
The most common methods for determining whether a document matches a query are the “boolean model” and the “statistical model.” According to the boolean model, a match occurs when a document's text index terms meet the boolean expression given by the user. The statistical model, on the other hand, is based on the similarity between statistical properties of the document and the query.
It is not unusual for conventional search engines using either approach to return a large number of matches for a simple query. When faced with a list of 20,000 hits in response to a query—not an uncommon experience when searching the web—a user cannot effectively review all the results. Whether the user accesses the matches serially or randomly, the review process takes an unwieldy amount of time to locate the documents of particular interest.
Typically, Internet web searchers provide the user with the first 10 hits and continue to provide additional blocks of 10 until the user finds something acceptable or gives up. If the user has a simple information need and the answer shows up in the first 10 or 20 hits, then this is not unreasonable. However, if the user has serious research interest in the results, then it may be important to see the information available in the remaining hits.
Consequently, the criteria by which these hits are ranked become very important. More and more systems support some type of ranking feature because users have demanded easy-to-use query languages and ranking to sort out the most important information. Most retrieval techniques provide ranked results with scoring methodologies that depend on statistics of the text indexed document collections.
This means that the scores assigned to documents in different collections, even when using the same scoring methodology, are not commensurate and can not be used as an adequate basis for combining the ranked results from two separate searches. This poses a problem for distributing the text indexing and retrieval processing among multiple nodes.
At the same time, the growing volume of material for text indexing has required search engine designers to focus on techniques for efficiency and volume processing, rather then on techniques for guaranteeing the best possible rankings. The conflict between these two objectives, accurate search results and text indexing huge collections of information, poses a significant problem for the developers of the next generation of text retrieval systems.