This specification relates to constructing text classifiers.
In Web search, advertising, or for special content providers, documents (e.g., Web pages and Web sites) can be given a high value if they are associated with a particular topic of interest and a low value if they are associated with an irrelevant or offensive topic. A topic can be a subject, theme, or category of interest, for example, “baseball”, “politics”, “weather.”
Thus, it is useful to be able to classify documents (e.g., particular Web pages or Web sites as a whole) as belonging to certain topics. One conventional technique for classifying documents is to use a linear classifier that uses the document text. Linear classifiers include a number of phrases known to be indicative of a given topic and a value for each of the phrases. The document is classified as belonging to the topic in question if the sum of the values for all of the phrases occurring in the document exceeds a specified threshold.