1. Field of the Invention
This invention relates to method of analyzing text for relevant content, more particularly for statistical methods for performing text analysis.
2. Background of the Invention
One of the most prevalent uses of text analysis today is by search engines on the Internet. These search engines take key words and search for relevant articles, web sites and discussions that include those words. The articles, web sites and discussions will be referred to hereinafter as stories. The use of text analysis also occurs in other types of text searching, such as computerized reference collections, electronic libraries and document management systems.
One challenge unique to the Internet is that content is being added to the collection of stories almost continuously. The number of stories that might have relevance to an inquiry can overload the user and bog down the system used for searching. This is especially true in view of the current methods for text searching employed by search engines and other types of text searching tools.
A current technique for text searching is shown in prior art FIG. 1. The query term is obtained in some manner, such as from a user entry or a user profile. If the query term is a single word, the story is searched for occurrences of that term. If the query term is a phrase, the story is searched for occurrences of each word of the phrase.
For single word query terms, a match results in the story being added to the list of stories that are relevant to the query. If the query term is a phrase, stories with each word of the phrase are added to a preliminary list associated with that word. An intersection of the preliminary lists is taken and those stories that are at the intersection of the preliminary lists (i.e., stories that contain all the words in the query phrase) are added to the list of stories that are relevant. A major drawback to this approach is that relevant stories may not contain the exact term. This problem is exacerbated when the information retrieval is based upon very concise documents, such as user profiles, and when the information itself is in the form of brief summaries such as news summaries that can be found at an Internet portal site like yahoo.com.
One way to overcome this problem is to add new terms to the original query that are related to the original terms. This task can be performed manually, but requires considerable expertise, both in searching and in the area being queried, an expertise most users lack. Performance of this task automatically falls under the category of Automated Query Expansion (AQE). There are three main approaches to AQE in the current literature.
The first approach is to use an online (electronic) thesaurus or dictionary such as WordNet. WordNet is a large, manually built, general-purpose semantic network, which models the lexical knowledge of a native English speaker. It is organized around groupings of words called synsets. Each synset contains synonymous words and relationships among them. The relationships take the form of IS-A, A-KIND-OF, etc. For example, using the relationship xe2x80x9ca snake is A-KIND-OF animal,xe2x80x9d a query using the word snake may expand to include the word animal.
Discussion of these types of approaches can be found in xe2x80x9cTREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanish,xe2x80x9d by Smeaton, et al., published in Fourth Text Retrieval Conference (TREC-4), Gaithersburg, Md., Nov. 1-3, 1995, (Smeaton) and xe2x80x9cInformation Access and Retrieval with Semantic Background Knowledge,xe2x80x9d by A. Chakravarthy, Ph.D. Thesis, MIT, Boston, Mass., 1995 (Chakravarthy).
A second category of AQE involves a pairwise association measure between words. The given corpus is analyzed to determine pairwise word associations and a query term is expanded to include terms having association values greater than a certain threshold. In one example, a pairwise mutual information value is determined from the context vectors of frequent words in the corpus. This example is discussed in xe2x80x9cCorpus Analysis for TREC 5 Query Expansion,xe2x80x9d by Gauch, et al., in Fifth Text Retrieval Conference (TREC-5), Gaithersburg, Md., 1996 (Gauch).
The third category of AQE uses blind relevance feedback in accordance with Rocchio""s algorithm. Blind feedback refers to the fact that relevance is not judged by the user but is determined by the system automatically. An initial search for the original query is performed and the retrieved documents are sorted according to some measure. The top few documents are assumed to be relevant for the original query. The original query is then expanded by using the terms in these relevant documents.
Articles discussing this approach include Mitra, et al. xe2x80x9cImproving Automatic Query Expansion,xe2x80x9d ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206-214, Melbourne, Australia, 1998 (Mitra); Buckley, et al. xe2x80x9cAutomatic Query Expansion Using SMART: TREC-3,xe2x80x9d Fourth Text Retrieval Conference, Gaithersburg, Md., 1994 (Buckley); and Abberley, et al. xe2x80x9cRetrieval of Broadcast News Documents with the THISL System,xe2x80x9d ICASSP ""98, Seattle, Wash., 1998 (Abberley).
However, all of these techniques still have serious drawbacks. A general-purpose database like WordNet has large coverage gaps when used for domains with their own specific vocabularies and sublanguages. Technology business news, for example, or technology advancements in highly jargon-filled technologies will have their own terms not recognized by general-purpose databases. It would be prohibitively time consuming to create a semantic network or to add terms to WordNet for each possible domain.
Another problem with the above approaches occurs with ambiguous terms. For example, the word xe2x80x9cbankxe2x80x9d may result in an expansion on rivers or an expansion on financial matters. Finally, methods using blind feedback are promising and are currently very popular. However, when the number of documents in the database is low, and/or is of short length, blind feedback runs into problems.
Pairwise association techniques such as the one discussed above usually use a symmetric cooccurrence matrix of words. In a symmetric matrix, the occurrence of one word of the pair triggers an expansion to include the other word. This can be problematic when one word is fairly common. For example, the use of the word cellular in a query would result in the expansion to include the word phone. This is probably not inaccurate because cellular commonly refers to phones. However, the use of a symmetric matrix results in the addition of the word cellular whenever the word phone is used. Given that phone is a fairly common word, this could result in an unnecessary expansion and irrelevant results. An example of an approach using this type of matrix is shown in U.S. Pat. No. 5,675,819, issued Dec. 7, 1995.
Other types of query expansion techniques have also been patented. U.S. Pat. No. 5,926,811, issued Jul. 20, 1999, forms a statistical thesaurus. However, the techniques used are not as sophisticated or exacting as those using pairwise associations or matrices. Finally, a method tagging speech by identifying the part of speech of a given word is shown in U.S. Pat. No. 5,721,902, issued Feb. 24, 1998.
Therefore, a method for more accurate query expansion that takes into account such things as asymmetrical pairwise word associations and domain-specific words and phrases is needed.
One aspect of the invention is a method for retrieving relevant stories from a collection of stories. The method includes the steps of identifying at least one query term, and applying a cooccurrence matrix to the query term to provide a list of query terms. Then, it is determined if a story in the collection contains any terms on the list of query terms. If the story does contain words in the list of query words, a relevance measure is increased. If the relevance measure is higher than a threshold, the story is added to a list of relevant stories.