Categories are a way of grouping objects that are similar according to one or more criteria. For example, text-based objects may be categorized by the language in which they are written or by subject matter. The most popular technique for categorizing documents by subject matter is the ad hoc method, in which someone reviews the document and assigns it to a category. This method is time-consuming and may not be practical for large sets of documents. For example, categorizing web pages as they are added to the World Wide Web may not be possible without a large work force.
Accordingly, automated approaches have been explored. One automated approach is the ‘vector-space’ method. In this method, a vector of uncommon features, typically words, is generated from objects assigned to a category. A similar vector is generated for an object to be categorized. The likelihood that the object should be assigned to a particular category is determined by calculating the cosine distance between the two vectors.
Another automated approach was developed in the field of information retrieval. Information retrieval typically uses inverse document frequency to weight a word based on the portion of documents in which it occurs. Similarly, an inverse feature frequency may be used to weight a feature based on the portion of documents in which it occurs.
Nonetheless, the techniques described above may not be ideal for categorizing an item within a database. Ad hoc categorization may vary from person to person doing the category selection. The vector-space method may require more analysis and computational power than necessary or appropriate to properly categorize an item. Finally, methods based on information retrieval may be optimized for items that contain a large amount of text and so may be inappropriate for items in which text is scantier.