Various approaches for automated categorization (or classification) of texts into predefined categories exist. One approach to this problem uses machine learning: a general inductive process automatically builds a classifier by learning, from a set of pre-classified documents that are represented as vectors of key terms, the characteristics of the categories. Various machine-learning techniques may be employed. In one approach, for each category, a set of human-labeled examples are collected as training data in order to build classifiers, such as Decision Tree classifiers, Naive Bayes classifiers, Support Vector Machines, Neural Networks, or the like. A separate classifier typically must be built for each new category. Such approaches also may not scale well when processing a large quantity of documents. For example, to add a new category, a new classifier may need to be built. Then, every document may need to be run through the resulting classifier.
In addition, various approaches to providing computer-generated news Web sites exist. One approach aggregates headlines from news sources worldwide, and groups similar stories together. The stories are grouped into a handful of broad, statically defined categories, such as Business, Sports, Entertainment, and the like. In some approaches, the presentation of news items may be customized, such as by allowing users to specify keywords to filter news items. However, such a keyword-based approach to customization may be limited because it can be difficult or impossible to express higher-order concepts with simple keywords. For example, if a user wishes to obtain articles about NBA basketball players, the term “NBA” may yield an over-inclusive result set, by including many articles that do not mention any basketball players. On the other hand, the terms “NBA basketball player” may yield an under-inclusive result set, by not including articles that do not include the specified keywords but that do mention some NBA basketball player by name.