Topic classification systems are a class of machine learning tools designed to classify media based on information that has been extracted from the media. When topic classification systems are applied to the area of natural language processing, natural language inputs are classified and labeled based on the classes or topics that are found within the inputs. Typically, natural language inputs include text intervals. Text intervals are spans of text that need not be well-formed sentences and can come from a variety of sources, such as newspaper articles, books, e-mail, web articles, etc. For example, if the topic within a particular text interval is determined to be “the looting of Iraqi art galleries in 2003”, a number of labels can be assigned, such as Iraqi art galleries, looting in Iraq in 2003, etc.
Although typical topic classification systems classify a large number of text intervals, the labels that are assigned to each text interval generally need to be defined in advance. For example, a database stores searchable text intervals, in which each text interval has been assigned pre-defined labels organized into a topic or keyword listing. When a user performs a database query using several keywords, the system produces a set of candidate text intervals that have labels containing one or more of those keywords.
However, in a situation where the system has no prior knowledge of the labels of the text intervals, the system needs to parse through a text interval to determine its labels. For example, if text intervals are provided at run-time via natural language or free-text formulations, the topic classification system can no longer rely on predefined labels to locate similar text intervals. An example of a free-text formulation of a query is “List facts about the widespread looting of Iraqi museums after the US invasion.” Free-text queries differ from structured queries such as database queries where query terms need to be explicitly provided.
In order to respond to these free-text queries, the topic classification system needs to analyze various text intervals in a natural language document and determine whether the candidate responses are on-topic with the free-text queries. Although the topic classification system can match a free-text query to a candidate response if the query is simple and specific, but the system is limited with respect to matching a free-text query that contains a lengthy or complex description of some topic or event. In addition, human language makes it possible to convey on-topic information without using words that were used in the actual topic formulation. For example, given a free-text query of “List facts about the widespread looting of Iraqi museums after the US invasion,” an on-topic response would be “Many works of art were stolen from Baghdad galleries in 2003.” Furthermore, the presence of topic words in a sentence does not guarantee that the response will be relevant. For example, given the same query as before, an off-topic response would be “There were no known instances of looting of Iraqi museums before the U.S. invasion.” In accordance with the present invention, a topic classification system that provides for matching complex free-text queries to candidate responses is provided.