The present application describes systems and techniques relating to information retrieval (IR) techniques, for example, taxonomy generation for document classification.
Searching information in a large collection of documents is often time-consuming. To increase the speed of the search, the document collection may be organized in a structural way, e.g., in clusters where documents of similar topics are stored together. Taxonomy generation deals with categorizing and labeling documents to satisfy a user's need for efficient document searching and retrieval.
A taxonomy is a set or hierarchy of categories that contains thematically similar objects. Examples of taxonomies include, e.g., genera and species in biology, and galaxies and stars in astronomy. In such taxonomies, the classification is based on clearly distinct properties such as shape or size. However, such classification criteria may not be suitable for taxonomies for document collections because different authors may use different words to express or describe the same themes.
Data and/or text mining methods may be used for the automatic classification of documents. An automatic classification system automatically assigns a document to a best matching class, i.e., a class in which the documents already assigned to it are more related to the “new” document being classified than any other document taken from another class. Typically the documents are standardized by segmenting the documents into individual words or phrases (i.e., “terms”). The documents are represented as term vectors in a term-vector space. The terms are the components (or coordinates) of the document (or its corresponding term vector). Each component of a term vector represents the frequency of one of the terms in the document, i.e., the term frequency (TF). The components of the term vector may be normalized using suitable statistical information.
Two documents may be considered similar if the angle between their corresponding term vectors is small, or equivalent, the cosine of the angle between the vectors is large. The cosine of the angle may be calculated from the normalized scalar product of the vectors. The larger the value of a component of a term vector, the more important is the corresponding term in the document and thus the higher its influence in the scalar product for calculating the similarity of two documents.
Methods such as K nearest neighbor (KNN), Rocchio, Bayes, and support vector machines (SVM) may be used in text mining systems for automatic classification of documents. These methods use a given allocation of example documents (“training documents”) to existing categories to calculate rules for classifying new documents. The training documents have a verified assignment to one (or more) classes and that the classification rule is calculated via the method which makes use of the term vector representation of these documents and their class assignments.