The analysis of telephone conversations between persons in charge of business and customers at call centers or business branches has become increasingly important for business analytics. In particular, attention is being focused on the analysis of part of the conversation that is unrelated to the business transaction (i.e. chat or small-talk), rather than the essential conversation part (i.e. question or explanation about a certain product). This focus is due to such an off-topic part itself being thought to include information that may be used by the business, i.e. hobbies, family structure, or work of the customer. It is thus important to extract the off-topic part from the conversation data, use the extracted off-topic part for profiling of the customer, categorize the off-topic part, and then tie the off-topic part to the next business activity.
Therefore there exists considerable research on the extraction of a topic from conversation data or document data, analysis, and segmentation according to topic of data including various types of topics. For example, Non-Patent Literature 1 (see citation list below) discloses a latent Dirichlet allocation method. The expression “latent Dirichlet allocation method” here indicates a probabilistic model of the document generation process that is capable of expressing the fact that multiple topics are included in a single document, and this technique considers the document as a collection of words and allocates a topic for each word unit.
Moreover, Non-Patent Literature 2 and Patent Literature 1 disclose a technique for detection of a change of topic in accompaniment with the passage of time. For such a technique, Non-Patent Literature 2 discloses the introduction of a compound topic model (CTM), and Patent Literature 1 discloses use of a mixed distribution model, expression of a top generation model, and online training of the topic generation model while more severely forgetting as data become excessive.
Moreover, Non-Patent Literature 3 discloses technology for topic detection that acquires in real time a newly appearing topic expressed by a community. The lifecycle of a word (term) using this topic detection technology is modeled according to aging theory that considers the influence power of the source.
Moreover, Patent Literature 2 exists as background technology that infers the topic of conversation as the subject of the content of the conversation. Patent Literature 2 discloses technology to infer as the subject of conversation text a conversation subject in which there appears a high proportion of multiple words in the conversation text.
The background art of the aforementioned Non-Patent Literatures 1 to 3 and Patent Literature 1 is established by the assumption of modeling of topics, or alternatively, by assumption that such a part (word) occurs based on some sort of latent model and at least a part (words) of the data is constructed from at least 1 specific topic. For this reason, the aforementioned background art technology cannot be used of course for direct model formation and to define a specific topic and to detect the off-topic part for which classification itself is difficult. Moreover, it is difficult to use the technology of Patent Literature 2, which requires training data that is the topic of conversation based on the text copy and properties of the off-topic part and topic of conversation specifying the contents of such text copy.
Furthermore, in the explanation of the background art of Patent Literature 3, a TF-IDF model was introduced as technology for extraction of important expressions from the document. In the TF-IDF model, the importance of terms appearing in multiple documents is low, and conversely, the terms expressed infrequently in documents are considered to have high importance. The number of documents including a term among each term within a corpus that includes the subject document is found, the inverse thereof is used as the degree of importance of the term within the corpus, and the TF and total TF-IDF are used as the degree of importance of a term within a document. Therefore, use of the TF-IDF model has been considered for extraction of the off-topic part. That is to say, due to it being possible to say that the off-topic part is not related to the business transaction that is the primary conversation, the IDF value is anticipated to become high, and making of the value of TF-IDF into an indicator for such extraction is considered. Furthermore, the general definition of IDF is the log of the inverse of the proportion of documents that include a subject term among the corpus including the target document.