The present disclosure relates to data clustering, and more specifically, although not exclusively, to clustering text-based documents according on topic or theme.
Topic analysis aims to discover the underlying topics or themes of text-based documents. Topic analysis may be desirable in numerous applications, such as in document management and retrieval. For example, processes for disentangling interleaved messages, which are exchanged in a chat messaging system or the like, may use topic analysis to identify a subset of messages that form part of a common conversation relating to a particular topic or theme over a period of time. In another example, so-called “catch-up” services for users of chat messaging systems or the like may use topic analysis for identifying a subset of messages relating to a particular user-selected topic or theme, and provide the identified messages to the user for review at a later time. Such applications benefit from the restructuring of otherwise chronologically-ordered messages into groups of similar and/or related messages. This reduces the amount of time, and consumption of communication, processing and storage resources as well as power, utilized to provide users with relevant messages, since only a group of messages needs to be retrieved and communicated to each user's device.
Manual processes for topic analysis, which may involve the use of manually labeled training data, are extremely time consuming and, in consequence, impractical for many applications. Accordingly, automated techniques for topic analysis based on topic modeling has undergone research in recent years.
U.S. Pat. No. 6,393,460 B1 concerns a method for informing a user of topics of discussion in a recorded chat between two or more people. The method involves topic analysis including decomposing the chat into utterances made by the people involved in the chat, and clustering the utterances, using document clustering techniques, to identify elements in the utterances having similar content. Some or all of the identified elements are labeled as topics and presented to the user.
U.S. Pat. No. 8,055,592 B2 concerns clustering data objects represented as a variable length vector of 0 to N members. Importance values of at least one member in the data objects are calculated. A plurality of clusters containing one or more data objects is dynamically formed. A data object is associated with a cluster in dependence upon the at least one member's similarity in comparison to members in other data objects. The clustering method is applied to chat messages, represented by a vector of most important words, to form clusters of messages on chat topics.
Journal of Machine Learning Research 3 (2003) 993-1022, David M. Blei et al, entitled “Latent Dirichlet Allocation” concerns an approach to topic analysis. In particular, it describes a generative probabilistic topic model for collections of discrete data such as text corpora, called “Latent Dirichlet Allocation” (LDA). LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. The LDA model may be used for unsupervised clustering of documents according to the topics of their content (e.g., documents having similar relevant keywords are grouped together).
For example, a corpus of documents may be analyzed by LDA for maximum likelihood fit for a predefined number of topics. A plurality of topics may be discovered, each topic comprising a list of representative keywords (i.e., “topic terms”) and each keyword having a corresponding Maximum Likelihood Estimation (MLE) score (also known as “Log likelihood value”). Typically, the representative list of keywords of a topic comprises the top N keywords ranked by MLE score, and the predefined number of topics are selected according to the sum of the MLE scores of keywords in their representative lists.
Whilst the above described methods can be used to identify underlying topics in a corpus of text documents, the results may be imprecise and produce non-homogeneous groups of documents. For example, in the case of disentanglement of chat messages, the described techniques may not accurately identify all the messages of a particular conversation, or, conversely, may identify messages that are part of a different conversation. Thus, structuring the documents into groups according to topics derived using existing topic modeling techniques may lead to a user receiving large numbers of irrelevant documents, which unnecessarily increases the use of power, as well as communication, processing and storage resources, of the user's device and the associated communication network. Conversely, structuring the documents into groups according to topics using existing topic modeling techniques may lead to a user not receiving all the relevant documents, which may lead to the user communicating additional requests for missing documents and then receiving another group of messages of which only one or a few may be relevant.
Accordingly, conventional techniques for topic analysis based on topic modeling for document management and retrieval are resource intensive and lead to inaccurate and imprecise results. In consequence, the provision of relevant documents to a user may be unnecessarily time consuming and utilize unnecessary amounts of power, as well as communication, processing and storage resources, of the user's device and communication network.