1. Field of the Invention
The present invention relates generally to systems and methods for information management, and more particularly to minimally predictive feature identification.
2. Discussion of Background Art
A great deal of work both in research and in practice in the field of information retrieval and machine learning for text classification begins by eliminating stopwords. Stopwords are typically words within a document which, depending upon an application, are of minimal use when effecting the application. Such stopwords are preferably removed from consideration by a human editor in whose judgment these words will not be of use during some predetermined information processing task.
In one application stopwords could be common words such as: “a”, “the”, “and”, and the like. For example, when a web search engine indexes web pages, it typically does not build reverse indices for words such as “a”, “the” and “of”.
Other applications include programs which attempt to analyze and categorize large document collections (e.g. customer support call logs, text strings, survey data, etc.) detailing a variety of customer issues and solutions provided thereto. Such document collections typically include a great many stopwords which tend to make analysis and categorization of the document collection overly complex and often yields somewhat confusing results and category descriptions. For example, analysis applications which use word counting techniques to generate lists of most frequently occurring words (or capitalized names or noun phrases or whatever) tend not to be very informative since such lists include a large number of stopwords (e.g. “of” and “the”) and other useless words unless the list has been manually tailored for the set of documents.
Thus, eliminating stopwords from a document collection before such collection is further processed can greatly reduce an application's use of computational and storage resources without significantly affecting the results.
Some current approaches for eliminating stopwords include:
1) Manual Editing: Stopword lists have traditionally been constructed manually based on an individual's judgment on which words in a document collection are not important in the context of a particular information processing application.;
2) Use of pre-existing stopword lists: Because stopword lists require such an effort to construct, users (especially researchers) often re-use existing lists of words from other projects and public lists. A significant problem with such an approach, however, is that stopword lists are known to be dependent on the document collection at hand. For example, in one application “can” might be considered a stopword (i.e. “I can see.”) However, in another application for glass and can recycling, “can” would tend not to be a stopword, and eliminating it would be devastating to a classifier tasked with the problem of separating documents about the two types of recycling. Similarly, stopwords are often dependent upon the document collection's language. For instance, documents written in German necessarily require a different stopword list from those written in French.;
3) Popular words as stopwords: In this approach, a computer counts the frequency of various words within a document collection and defines the most frequent word as stopwords. One disadvantage of such an approach is that many frequently occurring words are indeed useful for discriminating and managing documents. For example, in a collection of tech support documents that is 95% from Company-A and 5% from Company-B, the word “Company-A” might appear to be a stopword; however, people who are searching the document collection may wish to specifically identify or exclude documents from “Company-A”; and
4) Feature selection for identifying stopwords: Attempts to apply feature selection techniques in the field of machine learning to focus on the predictive words fall short since current feature selection techniques do not work unless the words in a document collection have already been organized into pre-defined categories (i.e. labeled). Even then, any predictive effect is limited to whether any given set of words is more or less predictive for a given predefined category or label, and not as to the document collection as a whole. In typical settings, no categories or labels are provided, and therefore current feature selection techniques cannot be applied to determine stopwords.
In response to the concerns discussed above, what is needed is a system and method for stopword identification that overcomes the problems of the prior art.