1. Technical Field
The present disclosure relates to classifying communications and more specifically to classifying communications that have low lexical content and/or high contextual content based on topic.
2. Introduction
Understanding a user's context in a unified communication (UC) setting is difficult. A UC engine needs to analyze the user's communication data. However, the UC engine needs some concept of similarity to appropriately retrieve this data based on new communication or activity. A context engine can use topic modeling to classify communication data, assuming unsupervised learning of topics from user's data.
Topic modeling has been used for unsupervised learning to classify documents in a corpus, such as a collection of journals. After the modeling, documents in the corpus are categorized into groups based on their lexical content and can be retrieved later in many ways. For example, after classifying a large collection of journals, a user can search the corpus for documents on “photography”, which is different from searching for the literal string “photography” in documents. This search covers documents with a high probability of being in the “photography” category, or having “photography” as a keyword.
However, the major problem remains that categorizing communication data in a UC setting for a context engine is extremely difficult. Unlike the classic usages of topic modeling, communication data has certain characteristics that make categorization more difficult to categorize effectively. For example, communication data often exhibit low lexical content. Unlike journal or news articles, books, and other lengthy documents, communication data such as email often have very low lexical content. An example of such communication data is an email that says “Here is an update” with an attachment. Lexically this communication has low content but in terms of fetching ‘relevant’ emails, the attachment could be highly relevant.
Further, communication data is heavily context based. An email (as well as other forms of communication data such as instant messages, call data, event information, etc.) can have different lexical content but assumes that the participants understand the context from prior communication(s). An example of such communication is “Here is an alternate approach” and the content can discuss about an alternate approach that has no lexical correlation with earlier communication. However, in terms of ‘relevance’ this email may belong to the same set of topics that are discussed based on the prior context among the participants.
Some existing approaches to this problem include grouping emails and topic modeling of a corpus. Google Mail groups emails (in their beta version of Priority Inbox) using user actions such as ‘reply/forward’ etc., and through label/tags that are user specified and threads emails. The goal of this approach is to minimize emails in a user's inbox and group them. The Google Mail approach does not rely on lexical aspects or latent topics in a user's email to group the mails and is limited to threading based on the above criteria.
Xobni provides a UC mash-up for a user that brings all information to user. The goal of Xobni is to provide a single interface for all of a user's emails and other data. Xobni do not group information based on latent content of communication. In terms of their scope, Xobni also try to provide a user with as much information as possible.
Some approaches use topic modeling to analyze social networks and blogs. Their goals are to identify the topics being discussed in blogs and to identify the social network of authors in the social networks. These approaches extend topic modeling by introducing an author and link formation model that shares the same (Dirichlet) parameter for inference.
They model author-recipient-topic (ART) relationship as a Bayesian network that discovers discussion topics in social networks. Again, these works assume rich lexical content, which is absent in many email style communications. Once the lexical content and the authorship are known, they focus on deriving relationships among the contribution of authors/participants to various topics. This does not solve the problem in communication corpus where the lexical content is low and the dependency of prior context is high.
A topic modeling approach for emails focuses on understanding user interactions or roles based on topic modeling. This approach primarily relies on lexical content to infer a latent topic and correlate that with user interactions. Their focus is to understand user activity with no emphasis in inferring correct topics for emails with low lexical content or with high contextual content.