This specification relates to classifying documents using scores from multiple classifiers.
Documents (e.g., Web pages or Web sites) can be classified according to one or more document properties. These classified documents can then be treated differently, for example, by a search engine or other information retrieval techniques. For example, a document property can be content of a special topic of interest, either because the topic is particularly desirable (e.g. financial sites would like to show detailed information about companies' business performance) or because the topic is undesirable (e.g. pornographic content (“porn”) or depictions of violence may be undesired in particular circumstances). Undesired documents can be filtered out from search results while desirable documents can be shown with a preference over documents having uncertain or different topics.
Documents can be classified according to different techniques. For example, human raters can be used to manually classify documents as having a specified property. While highly accurate, this is very time consuming for large numbers of documents (e.g., a collection of Web documents).
Alternatively, automatic classifiers can flag documents as likely having the particular property. Typically, the classifiers examine the documents for particular types of content, for example, images or text. However, conventional automatic classifiers often do not provide a likelihood that a document has the specified property with a confidence level sufficient to allow automatic actions. In particular, if there are classification systems on both the level of Web pages and Web sites, an action on the site level would affect all pages of that site, so an action on the site level has to have a very high confidence. If the Web site as a whole cannot be classified with high confidence, it may be preferable to classify the individual pages based on their individual content. In general this is more difficult because there is less information upon which to base the classification.