The exemplary embodiment relates to document classification and is described with illustrative reference to retrieval and categorization applications. It is to be appreciated that it may find application in numerous other applications entailing comparison, retrieval, categorization, and the like.
Many automated document processing tasks involve assigning one (or multiple) score(s) to a given document and then taking a decision by comparing the score(s) to a threshold. For example, document classification generally involves computing the relevance of a class with respect to a document, based on the content of the document (e.g., “is it probable that this photographic image contains a cat?” or “is it probable that this text document is relevant to a particular litigation matter?”) As another example, document retrieval generally involves computing a matching score between a query document and a set of database documents (e.g., “find the most similar images to this image of a dog”).
In most cases, the scoring process can be subdivided into two steps. In a first step, a global representation X of the document is computed. In a second step, a global score Y=f(X) is computed, based on the global representation. The first of these steps is typically the most computationally intensive one. Reducing its cost would therefore be desirable.
In the case of document classification, for example, various text and image classification techniques have been developed. For text classification, a text document can be classified as follows. First, an optical character recognition (OCR) engine is used to extract low-level features of the document, which in this case may be all the words in the document. Then, the document is described using a bag-of-words (BoW) histogram by counting the number of occurrences of each of a predetermined set of words. The histogram serves as the global representation of the document. The global representation is fed to a classifier which computes a score associated to a classification label. The score can be compared with a threshold to determine if the label is appropriate for the document. In general, successful text classifiers combine a linear support vector machine (SVM) classifier with high-dimensional representations. In this example, the OCR step is by far the most CPU-intensive step. For example, it may take several seconds per page, especially in the case of difficult documents such as noisy documents or documents with non-standard fonts. Comparatively, the cost of the rest of the processing is insignificant.
In the case of image classification, an image may be classified as follows. First, a predefined set of samples (i.e., local image sub-regions or “patches”) are selected. Patch descriptors (e.g., color or gradient descriptors) are extracted from each patch based on low level features and subsequently quantized into visual words. A local descriptor, such as a bag-of-visual-words (BoV) histogram, is then computed by counting the number of occurrences of each visual word. Separate histograms can be generated for each type of descriptor and then aggregated. The histogram serves as the global representation (global descriptor) of the document. The classification can then proceed as for text documents. The most computation intensive steps, by far, are the sample description and quantization. The cost of the rest of the processing method is insignificant.
In the case of document retrieval, the computation of the global descriptor of each document (text or image) to be compared is analogous and is also more computationally expensive than the steps of comparison and retrieval of similar documents.
The exemplary embodiment enables document classification tasks, such as categorization and retrieval, to be performed more efficiently by estimating the global score prior completion of the first, more computationally expensive step, and determining whether the estimated score is sufficiently reliable to be the basis of a classification decision.