1) Field of the Invention
The present invention relates to a technology for efficiently extracting queries which appear frequently from a large number of queries, in a support center or the like where lots of queries from users are collected.
2) Description of the Related Art
Heretofore, in the search relating to question answering in a call center or the like, it is important to efficiently reuse answers in the past. For example, with regard to frequently asked questions, by presenting a question which is frequently asked and an answer thereto to users as FAQ (Frequently Asked Question), users can solve the problem by themselves.
By preparing such FAQs beforehand, it is possible for the call center or the like to eliminate the question answering work by an operator, thereby enabling reduction in the operation cost.
In the conventional method, there are required manual operations for manually coordinating the FAQs and selecting an FAQ that is considered to appear frequently, based on the correspondence by mails or phone calls to the call center.
As a method for automating FAQ preparation, there is recently used a method of clustering documents having a high similarity, by calculating the similarity between documents (queries), using a criteria referred to as a Cosine measure. Clustering is a technique for extracting candidates of frequently appearing question examples from a large number of question answering logs in a practical period of time.
An example of determining the similarity in a document set (document 1 to document 4) shown in FIG. 22 will be explained below. Documents 1 to 4 are question examples from users sent to a call center of a communication service system.
At first, with the Cosine measure, a document is regarded as a vector, and a value obtained by calculating the inner product thereof is designated as a similarity between documents. In other words, in the document set shown in FIG. 22, if it is assumed that a vector of each document is formed of a word of the underlined part, the similarity between documents (inner product value) is calculated as described below. As the inner product value increases, the distance between documents becomes shorter, and the similarity increases.                similarity between document 1 and document 1=1/(√1*√1)=1        similarity between document 3 and document 3=6/(√1+1+1+1+1+1*√1+1+1+1+1+1)=1        similarity between document 1 and document 3=1/(√1*√1+1+1+1+1+1)=1/√6=5/√30        similarity between document 3 and document 4=3/(√1+1+1+1+1*√1+1+1+1+1+1)=3/√30        
As described above, heretofore, the similarity (5/√30) between document 1 and document 3 shown in FIG. 22 is calculated to be higher than the similarity (3/√30) between document 3 and document 4, but this calculation result is contrary to the intuition. That is, intuitively, it is felt that the similarity between document 3 and document 4 is higher than the similarity between document 1 and document 3.
With the Cosine measure, when the size of a document, being an object of similarity calculation, is hardly changed, it is possible to calculate the similarity with high accuracy. For example, as a document having few change, there can be mentioned a summary in a paper.
On the other hand, as the above-described queries, when a change in the document size is large, and frequency of appearance of a word is mostly 1, there is a disadvantage in the Cosine measure that the similarity of a document having a large document size becomes unreasonably low.
Therefore, there is a problem in the conventional method that even if the similarity by the conventional Cosine measure is directly used to perform clustering, only a cluster of short documents unreasonably grows, and a cluster of long documents does not grow, and hence balanced clustering cannot be performed, and desirable results cannot be obtained.
In order to solve the above problem, in the conventional information search, a tf·idf method which obtains a similarity between documents is used, by designating a significance of a word as a weight. The idf is an abbreviation of inverse document frequency, which is obtained by taking an inverse number of a document including a word with respect to the whole document set, and expresses the amount of information which the word itself has. Here, if it is assumed that the total document number (not shown) of the document set shown in FIG. 22 is 1024, the number of documents which include “connect” is 512, the number of documents which include “ISDN” is 8, and the number of documents which include “set” is 256, then, the idf of “connect” becomes 1, the idf of “ISDN” becomes 7, and the idf of “set” becomes 2. In this example, vector size is not normalized by idf.
The results obtained by recalculating the above similarity by taking this idf into account are shown below.                similarity between document 1 and document 1=1/(√1*√1)=1        similarity between document 1 and document 3=1/(√1*√1+1+1+1+1+1)1/√6=5/√30        similarity between document 3 and document 4=1+7+2/(√1+1+1+1+1*√1+1+1+1+1+1)=10/√30        
As is seen from the above calculation results, the similarity between document 3 and document 4 (10/√30) becomes higher than the similarity between document 1 and document 3 (5/√30), and coincides with the intuition. However, in this method, since the calculation result is a relative value, there is a disadvantage in that a similarity between different documents cannot be directly compared, even if we normalize text size by idf.