1. Field of the Invention
The present invention relates generally to systems and methods for data characterization, and more particularly to category discovery.
2. Discussion of Background Art
Companies offering technical support to customers often accumulate a large collection of documents (e.g. call logs, text strings, survey data, etc.) detailing a variety of customer issues and solutions provided thereto. While such a huge amount of information could be very useful in tracking and responding to customer concerns (e.g. “What are some of the most common types of trouble our customers are facing?”), such a large body of data tends to quickly become unwieldy over time due to the sheer volume of documents involved and the challenge of properly categorizing the documents and retrieving the information therein.
Companies which can successfully exploit such a wealth of information, however, could be able to reduce their customer support warranty costs, improve customer service, and provide better customer self-help resources on their external web site.
In some cases large document collections, such as those available to help-desk providers, or managed within a library, are manually categorized by web site administrators and librarians or web site administrators who have carefully constructed them. Each document is manually analyzed and tagged with a best guess set of topic categories. For example, a random sample of 100 or 1000 documents out of a document collection is selected, and one or more people manually go through them to ‘think up’ what the big topic categories appear to be to them. Such manual categorization is slow and expensive and must also be repeated over and over again as new documents are added to the collection and old documents removed.
Current automated ways of trying to categorize and label such categories (i.e. word counting or clustering), also tend not to work very well. The categories and labels generated by such automated methods either tend not to be very meaningful or in some cases to be very confusing.
For example, word counting techniques use a computer to generate a list of the most frequent words (or capitalized names or noun phrases or whatever). Such a list however tends not to be very informative and tends to include a large number of stopwords (e.g. “of” and “the”) and other useless words unless the list has been manually tailored for the set of documents. Also, since common issues may be described by more than one or different words, many words and phrases in the list may all refer to the same root issue.
Other approaches use text analysis software, such as TextAnalyst from Megaputer, which counts noun phrases. Such text analysis software however, tends to result in poor category trees, since the same basic topic could appear in multiple categories if different words are used in the document, or if there are misspellings.
Two different document categorization methods using clustering have been attempted as well. In the first approach, the documents are clustered, and partitioned. This first approach however tends to not work well with technical documents, resulting in many meaningless clusters, and distributing a single topic area over many different clusters. A second approach, such as that used by PolyVista Inc., clusters words in the documents as mini-topics. The end effect of this second approach however is much like the word count analysis discussed above, and the same types of issues tend to be distributed over multiple overlapping clusters. Such mini-topic clustering also tends to generate categories which contain many stopwords.
In response to the concerns discussed above, what is needed is a system and method for category discovery that overcomes the problems of the prior art.