The present invention relates generally the field of unstructured text analysis, and more particularly to determining the relevancy of user communications about unstructured information (e.g., articles, videos, etc.).
Many web sites solicit comments (communications) from readers about published articles (information). The communications enable readers to contribute additional information, typically in the form of posted comments. The communications generally contain text providing the reader's impression, opinion, or feedback about the published information. For example, an article published about an individual suffering from a serious illness, may prompt feedback from the readership about the illness. Unfortunately, many of the hundreds of communications received may be irrelevant to the topic (e.g., off topic or contain advertisements), or contain duplicate information. Readers may continue to post communications about the published information, but will be unwilling to read through hundreds of irrelevant communications to find the relevant communications. For a small number of communications, authors may remove communications through manual subjective censorship, but this activity does not scale well and is impractical for publications with hundreds of communications.