1. Field of the Invention
The present invention relates generally to information processing and data retrieval, and in particular to text processing systems.
2. Background
Text processing systems are often used to process the text of documents in a collection of documents. As is known in the art, the processing of the document text can be used to improve a user's ability to search and retrieve documents in a collection of documents that are most conceptually relevant to a search query performed by the user. U.S. Pat. No. 4,839,853 to Deerwester et al., for example, describes a text processing method in which words and phrases in a collection of documents are mapped into a Latent Semantic Indexing (LSI) space.
The concept of stop words naturally arises in text processing systems. Stop words are words that add little semantic value due to their commonness in the collection of documents. For normal English texts, stop words are the ubiquitous “the”, “and”, and similar words. In specialized domains, stop words can also include other words that are used so often that they add no value to the semantic content of the documents. For example, in a collection of Reuters news articles, each article will have the word “Reuters” in it. This word, in this context, can be treated as a stop word as it adds no meaning to the text of the article.
A typical text processing system requires a user to define the set of stop words that the system is to ignore. However, this is not the optimal method to determine stop words in a collection of documents. For example, the existence of polysemous words can potentially make the use of a list of stop words problematic. A polysemous word is a word that has multiple senses, such as the word “can.” One sense of the word “can” is associated with an ability to do something, as in “she can read.” Another sense of the word “can” is associated with a packaging device, such as “a can of beans.” In many user queries, including the word “can” in a list of stop words would be fine. In a query of documents about shipping and packaging, however, the sense of the word “can” as a packaging device may be relevant to a user's query. By requiring a list of stop words, this potentially relevant sense of the polysemous word “can” may be eliminated from the user's query.
Given the foregoing, what is needed is a method and computer program product to automatically identify and compensate for stop words in text processing systems. Such a method and computer program product should not require a user to provide a list of stop words. In addition, such a method and computer program product should retain information about polysemous words that may be relevant in certain contexts. Moreover, such a method and computer program product should be language independent.