With the advent of computer-based communications, the concept of text can mean many different things, such as online surveys, feedback forms, chat dialog, social media interactions and conversations, and so forth. These types of unstructured computer text are present across all business domains in a variety of forms. Often, the unstructured computer text contains stopwords, which are generally common words in a given language. These stopwords are considered as ‘noise’ in unstructured text, add little value to analytics, and need to be removed in order to improve the quality of the unstructured computer text from the perspective of understanding critical aspects of the text content—such as intent, sentiment, and the like—in downstream applications.
Present computer systems that handle removal of stopwords from unstructured computer text typically use predefined lists of generic stopwords, and add to those lists over time as new stopwords are identified. However, such systems suffer from the problem that the definition of informative value changes from domain to domain. For example, the word “what” may be a stopword in the context of a computerized search engine, but a question-answering computer system may consider the word “what” to be highly relevant and not a stopword.
As such, these computer systems typically fail to account for domain-specific considerations. For example, certain organizations may use terms or words that have meaning to the organization (e.g., ID numbers, dollar values, entity references, abbreviations, and the like) but are still considered stopwords that degrade the quality of the unstructured computer text. These domain-specific stopwords are noise that must be removed, but they are not generic enough to be filtered out by standard techniques.
Existing stopword removal techniques—such as defining regular expressions—do not solve the above-noted problem because such regular expressions are cumbersome to define and the number of possible combinations is very high. Present computer systems that leverage approaches such as Poisson distributions or Kullback-Liebler (KL) divergence techniques typically do not provide enough coverage to result in precise stopword removal for a large amount of unstructured computer text. And, existing computer systems make use of statistical quality measures like how frequently terms appear or how informative terms are in order to identify stopwords, but such techniques are generally inaccurate and fail to capture all of the critical stopwords in the text.