The present invention generally relates to a system and method for classifying and analyzing data, and is particularly applicable to a method for automatically generating a list of xe2x80x9cFrequently Asked Questionsxe2x80x9d or FAQs, by analyzing data sets describing calls and responses received at a help desk.
As technology becomes ever more pervasive it has become increasingly common for organizations to provide a helpdesk service to their customers. Typically, a customer will call the helpdesk to ask for information and to seek solutions to problems relating to the operation of products, the performance of services, necessary procedures and forms, etc.
Typically, helpdesks are staffed by knowledgeable human operators, who often spend considerable time with each caller in order to answer the caller""s questions. As a result, helpdesk operation could be quite expensive to maintain.
Much of the helpdesk operator""s time is spent solving identical or nearly identical problems over and over again. A need arises for a technique by which the solutions to frequently recurring problems may be automated in order to improve the efficiency of helpdesk operation. In particular, what is needed is a technique that can aid in identification of helpdesk inquiry and problem categories that are most amenable to automated fulfillment or solution.
The present invention is useful in identifying candidate helpdesk problem categories that are most amenable to automated solutions. In a preferred embodiment, the present invention uses clustering techniques to identify collections of problems from free form text descriptions. It then facilitates a human user""s modifications to collections as appropriate to improve the coherence and usefulness of the classification. Measures such as the level of detail, the depth of search, the confidence level, and overlap levels, are used to help the user determine which set of examples are the best candidates to become a FAQ.
The present invention describes a method, system, and a computer program product for interactive classification and analysis. In order to carry out the method, a dictionary is generated whereby each word in the text data set is identified, and the number of documents containing these words is counted. The most frequently occurring words in the corpus compose a dictionary. A count of occurrences of each word in the dictionary within each document in the document set is generated. The count may be generated by generating a matrix having rows and columns, each column corresponding to a word in the dictionary, each row corresponding to an example in the text corpus, and each entry representing a number of occurrences of the corresponding word in each example.
The set of documents may be partitioned by partitioning the set of examples into a plurality of clusters using a k-means partitioning procedure. The k-means partitioning procedure may include determining a distance between a centroid and an example vector using a distance function of:
d(X,Y)=xe2x88x92X.Y/∥X∥.∥Y∥
wherein X is the centroid, Y is the example vector, and d(X,Y) is the distance between the centroid and the example vector.
For each of the generated clusters, the present method sorts the dictionary terms in order of decreasing occurrence frequency within the cluster. It then determines a search space by selecting the top (or frequent) S dictionary terms, where S is a user specified value specifying the depth of search. Next, it chooses a set of L terms from the search space, where L is a user-specified value indicating the desired level of detail.
For each possible combination of L terms in the search space, the present method finds the number of examples containing all L terms. If this number is not null, and if the overlap between this set and all the other sets is less than an overlap value specified by user input, then this set of examples becomes a FAQ.
For each generated FAQ, the present method chooses a name based on the relevant terms in the order in which they occur most often in the text.