1. Field of the Invention
The present invention relates to an apparatus and accompanying methods for a system that accepts a query and fulfills an information need expressed by the query based on a body of information such as a collection of documents. The invention has particular utility in connection with text indexing and retrieval systems, such as retrieval of information from the World Wide Web.
2. Description of the Related Art
With the rapid growth in the amount of information available in the form of documents stored in databases has come an increased need to efficiently extract information relevant to a specific need. Traditional searching methods search and retrieve documents according to the words in a given input query. Search engines allow users to find documents containing one or more words or phrases, often referred to as keywords, found in the input query and return a list of relevant documents for the input query. For instance, with traditional search and retrieval methods, the input query
Clinton congress
returns the most relevant documents containing the word Clinton or the word congress or both words. Search engines may also permit the formulation of Boolean queries, which allow words in the query to be combined using logical operations such as AND, OR, and NOT. For example, using a traditional Boolean search engine, the query
Clinton AND congress AND (NOT Hillary)
selects documents that contain the word Clinton and the word congress but that do not contain the word Hillary.
Another feature often found in queries used with traditional search engines is the ability to trigger
a search for phrases in documents. For example the query
“Bill Clinton”
retrieves documents that contain the exact phrase Bill Clinton while rejecting documents that contain the word Bill and/or the word Clinton separately. Examples of search engines offering these capabilities are search engines used with the World Wide Web such as Alta Vista™, Lycos™, Inktomi™, InfoSeek™, NorthernLight™, HotBot™, MSN Search™, Google™ and Yahoo!™. Additional search engines include those used for searching documents found in databases, digital libraries or other information sources such as InfoSeek UltraSeek Server™.
The result of a search using search engines such as those mentioned above is a list of relevant documents, generally displayed in some order, for example, from the most relevant document to the least relevant document. To present documents in an order, search engines rank the documents according to some metric. Typically, the ranking will first show documents containing the highest number of keywords. For example, FIG. 1 shows the result of a search on HotBot™ (http://www.hotbot.com) for the query presidential candidates. FIG. 1 shows the first 10 documents ranked from the most relevant document to the least relevant document. Each of the results consists of a description of a document. Such description includes the title of the document, a description of the document, and its Internet Uniform Resource Locator (URL).
Many conventional search engines, and World Wide Web search engines in particular, suffer from a drawback in that they only allow for a fully specified query and do not allow for queries where a portion can be wholly or partly unspecified. For example, when the phrase Alexander Bell is searched using traditional technologies, all documents containing the phrase Alexander Bell will be returned. However, documents which contain the phrase Alexander Graham Bell will not be returned because the phrase Alexander Graham Bell is different than the phrase Alexander Bell specified in the query. Similarly, if one employs the Boolean query Alexander AND Bell to search for documents, the result of the search will include documents that contain the words Alexander and Bell anywhere in the document, regardless of where they occur in relation to each other. Thus the search will return many documents that do not contain the information sought by a user. For example, the search will return documents that contain Alexander Heard and Packard Bell since such documents contain the words Alexander and Bell. Such documents are unlikely to be relevant to a search for documents about Alexander Bell. With most conventional search engines, as described above, it is not possible to search for the phrase Alexander followed by any word and followed by Bell, e.g., Alexander_Bell. Such a query should match documents that contain the string Alexander Graham Bell.
To provide more flexibility in searching, some search and query systems offer proximity operators. These operators allow a user to request that the words in the query be within a certain distance of each other in a document. For example, a query using the NEAR operator (a feature of the Alta Vista™ search engine) retrieves documents that contain the search terms within 10 words of each other. Thus a user seeking to learn the middle name of Alexander Bell could enter a query such as
Alexander NEAR Bell
This query will retrieve documents containing any of the strings Alexander Bell, Alexander Graham Bell, Alexander Heard and Packard Bell, Alexander heard the bell, Alexander Frederick Edward Bell, among others. Note that these strings contain several types of words such as verbs, determiners, etc., in addition to names. Thus the query results still suffer from a lack of specificity. Certain search and query systems such as DIALOG® allow a further refinement in that the user may request that the search terms occur within a given distance of each other within a document. For example, the (3W) operator requests that the search terms occur within three words of each other. This option offers more control over the query results but still retrieves documents based on the positional relationship of the search terms within the document, regardless of the actual content of the intervening text. Although reducing the number of intervening terms may reduce the number of irrelevant documents retrieved, in the case of many queries it may also result in failure to retrieve relevant documents.
Furthermore, the query structure available in traditional search and query systems is not conducive to searching for information that may lie outside the bounds of the search terms. For example, a user who has forgotten the last name of George Washington Carver (an agricultural chemist and holder of U.S. Pat. Nos. 1,522,176 and 1,541,478) could conduct a search using the specified terms George and Washington. While yielding hundreds of documents related to George Washington, such a search would be unlikely to provide the desired information about George Washington Carver. There is no way for a user to include both fully specified terms such as George and Washington in the query in addition to clearly indicating a desire to retrieve a particular unspecified element, such as a term that follows the specified terms.
Another area in which conventional searching systems are limited is in their ability to search effectively for phrases. In prior art systems, identifying documents that contain phrases requires either indexing the phrases as single terms or identifying documents that contain the individual terms, locating the terms, and determining their positional relationship to each other. Indexing phrases as single terms is generally limited to phrases of two or at most three terms, and it is not feasible to index all or even a majority of phrases. Also, in order to provide information about the terms adjacent to or surrounding a phrase of interest, prior art systems must access the document that contains the phrase.
In addition to the limitations discussed above, conventional searching systems suffer from a major drawback in that they typically provide only names, titles, URLs, file names, or identifiers of documents in response to a query. For example, conventional Web search engines provide only the name of a document containing a search term, together with a uniform resource locator (URL) for that document. The user must then access and examine the document in order to locate the desired information. Especially in the case of a large database such as the World Wide Web, this can be a significant burden, particularly if the query structure does not permit the user to accurately indicate what information is needed. Even in the case of those search engines that return a passage of text that includes the query terms in addition to returning an identifier for the document in which the text appears, the user may still need to scan through many such passages to locate desired information. Since it is not possible to place any restrictions on the text between or around the search terms, the user is likely to encounter a great deal of material that is irrelevant to the actual information desired. Furthermore, results are returned on a document by document basis. The order of the documents is frequently based on the number of occurrences of the search terms within each document. Traditional search and query systems do not perform an assessment of the results of the query across multiple documents. Thus the results provided by traditional search and query systems do not consider the information content of the database as a whole.
There exists a significant need for a search and query system that overcomes the various limitations described above. In particular, there is a need for a system that accepts a partially unspecified query and returns the actual text that matches the query, in addition to or instead of the list of documents that contain a match for the specified portion of the query. For example, there is a need for a search system that accepts the partially unspecified query Alexander_Bell and returns the middle name Graham or the string Alexander Graham Bell instead of or in addition to the list of documents which match the partially unspecified query. As another example, there is a need for a search system that accepts the partially unspecified query Microsoft Windows_, and returns the strings Microsoft Windows 95, Microsoft Windows 98, Microsoft Windows NT, and Microsoft Windows 2000 in addition to or instead of the list of documents that match the query. Furthermore, there is a need for a search and query system that accepts a partially unspecified query and allows the user to accurately indicate a specific information need by placing restrictions on the unspecified portion of the query. In addition, there is a need for an innovative approach to the tasks of searching for phrases and providing information about terms adjacent to the phrases.
Finally, when there are multiple different strings that match the partially unspecified query, there is a need to present those strings in a relevant order. In particular, there is a need for a search and query system that ranks the results based not on the contents of the individual documents that contain the search terms, but rather on the content of multiple documents. In other words, there is a need for a search system that considers the query in relation to the information available within a large body of information such as a large collection of documents rather than in relation to single documents.