A challenge facing data management systems that manage text, such as articles, web pages, survey responses, electronic mail messages, support documents, books, and so forth, written by humans (or by computers) is the identification of what the textual data is about. Basically, the challenge involves identifying an accurate set of one or more topics for each item of textual data. Once items of textual data have been categorized into various topics, a data management system would be able to use this categorization to perform various tasks with respect to the textual data, such as deciding where to store the textual data items, searching for information, or other tasks.
Conventionally, classifiers have often been used to select one or more topics, from a set of possible topics, to assign to each item of textual data. However, classifier-based techniques for assigning topics to pieces of textual data are associated with various drawbacks that can make classifications performed by classifiers inaccurate.