User generated content (UGC) is one of the fastest-growing areas of use of the Internet. Such content includes social question answering, social bookmarking, social networking, social video sharing, social photo sharing, and the like. UGC websites or portals providing these services not only connect users directly to their information needs, but also change everyday users from content consumers to content creators.
One type of UGC portal that has become very popular in recent years is the community question-answering site (CQA site). CQA sites currently available on the World Wide Web attract a large number of users both seeking and providing answers to a variety of questions on a variety of subjects. For example, as of December, 2007, one popular CQA site, Yahoo!®Answers, had attracted 120 million users worldwide, and had 400 million answers to questions available. A typical characteristic of such CQA sites is that the sites allow anyone to post or answer any questions on any subject. However, the openness of these sites predictably leads to a high variance in the quality of the answers. For example, since anyone can post any content they choose, content spam is produced by some users for fun or for profit. Thus, the ability, or inability, to obtain a high-quality answer has a significant impact on users' satisfaction with such sites.
Distinguishing high-quality answers from other answers on CQA sites is not a trivial task. For example, a lexical gap typically exists between a question and a high-quality answer that responds to the question. This lexical gap in community question answering may be caused by at least two factors: (1) a textual mismatch between questions and answers; and (2) user generated spam or flippant answers. In the first case, user-generated questions and answers are generally short, and the words that appear in a question are not necessarily repeated in high-quality answers that respond to the question. Moreover, a word itself can be ambiguous or have multiple meanings, e.g., “apple” can have a number of meanings, such as applying to “apple computer”, “apple the fruit”, etc. Also, different words can be used to represent the same concept, e.g., “car” and “automobile”. Furthermore, in the second case, user generated spam and flippant answers can have a negative effect by greatly increasing the number of answers to each question, thereby generating “noise” and increasing the complexity in identifying a high-quality answer from among numerous other answers.
To bridge the lexical gap for better answer ranking, various techniques have been proposed. Conventional techniques for filtering answers primarily focus on generating complementary features provided by highly structured CQA sites, or finding textual clues using machine-learning techniques. For example, some conventional techniques enrich textual features with additional non-textual features, e.g., providing an answerers' category specialty, a questioners' self-evaluation, users' votes, and the like.