The exemplary embodiment relates to feature extraction. It finds particular application in the extraction of features from documents, such as scanned documents, where extracted text sequences may include noise, in the form of unpredictable variations from the text of the original document. The extracted features find use in classification and other document processing applications.
Optical character recognition (OCR) techniques employ software which extracts textual information from scanned images. Such techniques have been applied to extract textual information from books, business cards, and the like. Once text is extracted, each text line can be tagged as to data type. In the case of business cards, for example, the data types may include “personal name,” “job title,” “entity affiliation,” “telephone number,” “e-mail address,” “company URL,” and the like. OCR techniques invariably result in some errors, both in the recognition of the individual characters in the digital document and in the correct association of the extracted information with specific data types (tagging).
In a supervised learning approach, a training set of objects, such as text sequences extracted from OCR-ed text documents, is provided with pre-determined class labels. Features of the objects are identified, and a classifier is trained to identify class members based on characteristic features identified from the training set. In some approaches, the class labels may not be provided a priori but rather extracted by grouping together objects of the training set with similar sets of features. This is sometimes referred to as unsupervised learning or clustering.
In the analysis of complex input data, one major problem is the number of features used. The computational complexity of categorization increases rapidly with increasing numbers of objects in the training set, with increasing number of features, and with increasing number of classes. Data analysis with too many features generally requires a large amount of memory and the computation power. Additionally, the classification algorithm may overfit on the training samples and generalize poorly to new samples.
When the input data is too complex to be processed, it can be transformed into a reduced representative set of features; such a transformation is called features extraction. One way to reduce this complexity is to reduce the number of features under consideration. By reducing the number of features, advantages such as faster learning and prediction, easier interpretation, and generalization are typically obtained. If the features are carefully chosen, they are expected to extract the relevant information from the input data in order to perform the desired task, such as populating a form, directing mail, categorizing documents, or the like. However, the removal of features can adversely impact the classification accuracy.
One goal in feature extraction is thus to construct combinations of features that reduce these problems while still describing the complex data with sufficient accuracy. Both rule-based and learning-based systems commonly use rules and regular expressions to analyze the text. As manually-crafted rules for analyzing text tend to be very sensitive to OCR errors, string distance and equivalent dictionary-based techniques and fuzzy rules have been proposed.
Feature extraction from noisy documents is even more challenging when the content noise (OCR errors) is accompanied with the structural noise (segmentation errors). In scanned and OCR-ed documents, document sequences are often under- or over-segmented. In semi-structured documents, the segmentation inconsistency can result from an ambiguous page layout, format conversion, and other issues.
The exemplary embodiment provides an automated method of feature extraction suited to layout-oriented and semi-structured documents, which finds application, for example, in the context of metadata extraction and element recognition tasks.