The present invention relates generally to the field of information retrieval. In particular, the present invention relates to a method and apparatus for selecting the optimal number of terms for retrieving documents using a vector space analysis technique.
Advances in electronic storage technology have resulted in the creation of vast databases of documents stored in electronic form. These databases can be accessed from remote locations around the world. As a result, vast amounts of information are available to a wide variety of individuals. Moreover, information is not only stored in electronic form but it is created in electronic form and disseminated throughout the world. Sources for the electronic creation of such information include news, periodicals, as well as radio, television and Internet services. All of this information is also made available to the world through computer networks, such as the worldwide web, on a real time basis. The problem with this proliferation of electronic information, however, is how any one individual may access useful information in a timely manner.
When a user wants to search for information, she may provide a computer system with a query or description of her interest. For example, a user interested in sport may type the query xe2x80x9cbasketball from Olympics ""96xe2x80x9d (the query is a phrase) or may just type the terms xe2x80x9cbasketballxe2x80x9d, and xe2x80x9cOlympic ""96xe2x80x9d. Using grammar rules and a lexicon, a search engine may extract the terms from the query and construct its internal representation of the query, called a profile. In the above examples, the profile will contain the terms xe2x80x9cbasketballxe2x80x9d and xe2x80x9cOlympics ""96xe2x80x9d.
Profile training is the process of improving the formulation of a profile using a set of documents that the user considers representative for her interest (training data). The search engine extracts new terms from the training data and adds them to the initial profile. For example, after entering the query xe2x80x9cbasketball from Olympics ""96xe2x80x9d, the user may point the system to an article that describes a basketball game from two days ago. From this article, the system extracts the terms xe2x80x9cbasketballxe2x80x9d, xe2x80x9cgamexe2x80x9d, xe2x80x9cballxe2x80x9d, and xe2x80x9cscorexe2x80x9d. Then the profile will contain the terms xe2x80x9cbasketballxe2x80x9d, xe2x80x9cOlympics ""96xe2x80x9d, xe2x80x9cgamexe2x80x9d, xe2x80x9cballxe2x80x9d, and xe2x80x9cscorexe2x80x9d. The user may even not provide any initial description of her interest (in which case, the initial profile is empty), but just give the system some training data to extract terms from. In the above example, without an initial description, the profile will contain only the terms extracted from the article, xe2x80x9cbasketballxe2x80x9d, xe2x80x9cgamexe2x80x9d, xe2x80x9cballxe2x80x9d, and xe2x80x9cscorexe2x80x9d.
The main input components of profile training are the initial description, the training database, the reference database, and the terms extraction algorithm. The training database contains articles that match the user""s interest (training data). The terms extraction algorithm extracts terms from the training data and adds them to the profile. The reference database contains information that helps the extraction algorithm to decide whether or not to include in the profile a term from the training data. This is because the training data may contain terms that are not related to the user""s interest and if included in the profile, may return non-relevant documents. In the above example, if the training article mentions that a basketball player likes piano, then adding the term xe2x80x9cpianoxe2x80x9d to the profile will make the search engine retrieve articles related to music, which do not correspond to the user""s interest in basketball. The assumption in using a reference database is that the terms extraction algorithm differentiate between the terms in the training data that are linked to the user""s interest and the terms that are not.
Typically, the training documents contain a large number of terms. Selecting only the most representative terms from this set can improve efficiency and effectiveness of the retrieval process. To make use of training documents, a terms extraction algorithm creates a list of all the terms from the training data. To every term it attaches a weight based on the information in the reference database. The terms are then sorted in decreasing order of their weights, such that the term with the highest weight is the first. If the search engine wants to add to the profile n terms from the training data, then the first n terms from the sorted list of terms are added to the profile. Therefore, to train a profile we need two important elements: (1) a method to assign weights to the terms, and (2) a cut-off method to determine the number of terms to be added in a profile.
There have been may term selection methods proposed in literature based on the vector space and probabilistic models. Regardless of the method, the number of terms in a profile is generally the value for which experiments show a reasonable behavior (e.g., the first 30 or 50 terms) and it is a constant for all the profiles. There are also methods that associate a different number of terms to each profile. One example is to compute the number of terms with the formula 10+10 log(T), where (T) is the number of training documents per profile. However, the number of terms chosen according to such a formula is generally too large and there are many cases when more flexibility is needed. For example, there are document collections in which many profiles achieve best average precision with just one term. Another method is to compute the sum of the weights for all the terms and add terms in a profile until a specified fraction of the sum is achieved. This approach, again, may not detect the situations when profiles need very few terms.
It is an object of the present invention to provide a method and apparatus to effect improved information extraction from a variety of data sources.
It is a further object of the present invention to improve information extraction by selecting the appropriate number of terms in creating a profile.
It is a further object of the present invention to improve information extraction by minimizing the number of terms in a profile.