Threaded discussions are a popular option for web users to exchange opinions and share knowledge. These threaded discussions include thousands of web forum sites, mailing lists, chat rooms, blogs, instant messaging groups, and so forth. A threaded discussion is a tool to facilitate collaborative content contributions. With millions of users contributing to these threaded discussions, the result is a vast accumulation of highly valuable knowledge and information on a variety of topics. These topics include recreation, sports, games, computers, art, society, science, home, health, and especially topics related to our daily lives which are rarely seen in traditional web pages.
As a result of the popularity of threaded discussions, there have been increased research efforts on mining information from online discussion threads. A discussion thread usually originates from a root post by a thread starter. FIG. 1 is an example of the semantic and structure of a typical threaded discussion. In particular, note that in FIG. 1 a threaded discussion 100 contains seven posts. The first post 110 is a piece of news about the release of “SilverLight 2.0 version.” Some users comment on this post, such as in the second post 115 and the third post 120, which are about the “update time.” Some users have further questions and initiate sub-discussions, such as in the fifth post 125, the sixth post 130, and the seventh post 135, which are about “Javascript communication.” As shown in FIG. 1, others troll or complain in some posts, such as in the fourth post 140.
As more users join in the threaded discussion and make comments, the discussion thread grows. This forms a nested dialogue structure 150 that can be seen in the left side of FIG. 1. Furthermore, threaded discussion 100 exhibits rich complexity in the semantics (or topics) 160. Since users typically respond to others, previous posts affect later posts and cause the topic drift in a discussion thread. In particular, as shown in the right side of FIG. 1, the topic has drifted from a first topic (“Silverlight 1.0”) 165 to a second topic (“junk”) 170 to a third topic (“Javascript”) 175.
Mining discussion threads is a challenging problem. One reason it is so difficult is that posts in a discussion thread are temporally dependent upon each other. A newcomer to the discussion may read some of the previous messages before posting. Replies indicate sharing of topics and vice versa. Thus, by nature a post is a mixture and mutation of previous posts. Unfortunately, such specific orderings and intra-dependencies of posts in a single thread are neglected by most existing research methods. Another reason it is so difficult to mine discussion threads is that while discussion threads are designed to encourage content distribution and contribution, they sometimes become targets of spammers. Meanwhile, some messages contain no useful information or are casual chitchat and thus are needless to analyze. Posts of useless information, spam, or chitchat are regarded as junk posts. Junk posts are useless and may disturb content analysis. A third reason mining discussion threads is challenging is that it is very hard to estimate the quality of a post. Generally, some valuable posts are long and some meaningless posts are short. However, this does not always hold true. There is a remarkable amount of long meaningless posts that are not meant to help others, while there are some short insightful posts that inspire a great deal of people. Thus measurements solely based on content length or content relevance usually do not work.
Although previous research efforts have made progress in many information retrieval scenarios, few of them are suitable for mining online discussion threads. Current work in mining discussion threads generally can be classified into two categories: (1) semantics-based techniques; and, linking structure-based techniques.
Many techniques exist that use semantic models for discussion thread analysis. One class of techniques is probabilistic topic models. Their main idea is to project documents to some latent topic space. However, most topic models assume documents in one collection to be exchangeable. In other words, their probabilities are invariant to permutation. This is contrary to the reply relationships among posts. Another class of techniques decomposes documents into a small number of topics which are distributions over words. In one such technique, each document is produced by choosing a distribution over topics, with a Dirichlet prior, and each word is sampled from a multinomial of topic-word association. Some work has been proposed to extend these techniques to model multiple relationships, such as authorship and email. Other techniques attempt to model the background, topic, and document specific words simultaneously. Some recent techniques model time dependency among documents, such as modeling the dependency in discrete time periods while considering time to be continuous. However, these models only consider the topic drift within two adjacent time periods, which is not suitable for the hierarchical intra-dependency of posts in a single discussion thread. In general, the main drawback of using semantic models for discussion thread analysis is that they only capture the semantic information but ignore the temporal structural information.
There also are several techniques that use structure models for discussion thread analysis. These techniques attempt to identify the importance of content of each document. The structure model is generally a collection of documents and linkages between them are constructed as a graph. The nodal importance or nodal quality can be estimated by the structural centrality of the nodes in the graph, where the importance refers to authority, popularity, expertise or impact in various applications. Some of these techniques are carried out in an iterative manner, propagating the authority and hub of one node to another. One such technique uses a damping factor to simulate the random walks of a surfer who is continuously jumping from one web-page to another linked page with a uniform probability. Although structural models have been applied with remarkable success in different domains, it is not suitable to analyze threaded discussions. This is because these works rely highly on the link structure among documents, while there is usually no explicit link structure among the posts in a discussion thread.
The semantics and structure of a threaded discussion are highly dependent on each other. In particular, when semantics evolves the dialogue structure changes, and vice versa. This is the nature of discussion threads. Most previous research efforts can not solve this problem directly as they are solely from the semantic-centric view or from the structure-centric view. There is little previous work in mining threaded discussions using semantic model and structure models simultaneously. The closest techniques to modeling semantic and structure simultaneously merely implement the semantic decomposition and structure reconstruction in two phases. However, this which conflicts with the above discussion that structure in discussion threads usually changes along with semantics. Consequently, the reconstructed structure generated by these techniques is not consistent with the evolving nature of semantics in threaded discussions.