1. Field of the Invention
Embodiments of the present invention relate generally to data capture using optical character recognition (OCR), and specifically to a method and system for automatic classification of different types of documents, especially different kinds of forms.
2. Related Art
According to known methods of text pre-recognition, an image is parsed into regions containing text and/or non-text regions, with further dividing said text regions into objects, containing strings, words, character groups, characters, etc.
Some known methods preliminarily use document type identification for narrowing a list of possible document types by examining the document logical structure.
According to this group of methods, the document type identification is an independent step of document analysis, forestalling logical structure identification. Only after identifying a document type and its properties list can the logical structure thereof be determined. Also, identifying document structure may be an integral part of a logical structure identification process. In this case, the document type that fits closer to the analyzed image is selected.
The document logical structure examination requires dividing the document image into elements of different types. For example, a single element of a document can contain its title, author name, date of the document or the main text, etc. The composition of the document elements depends upon its type.
Typically, the document logical structure is performed in one or more of the following ways:
on the basis of fixed elements location,
using a table or multi-column structure,
on the basis of structural image identification, and
via specialized methods for special documents types.
A method from the first group (fixed element location) requires locating fixed structural elements and involves marking fields, i.e., image regions containing elements of documents of standard form. The exact location of elements on the form may be distorted by scanning. The distortion may be one or more of various kinds: shift, a small turn angle, a large turn angle, compression and stretching.
All kinds of distortion usually can be eliminated on the first stage of document image processing.
The coordinates of regions may be found relative to the following:
image edges,
special reference points,
remarkable form elements, and
a correlation function, taking into account all or a part of the listed above.
Sometimes distortion may be ignored due to its negligibility. Then, image coordinates are computed relatively to document image edges.
Many of the methods for form type identification use special graphic objects as reliable and identifiable reference points. Special graphic objects may be black squares or rectangles, short dividing lines composed of a cross or corner, etc. By searching and identifying a reference point location, or combination of reference point locations, in a document image using a special model, the type of the analyzed form can be correctly identified.
If the number of documents to be processed is large, automated data input and document capture systems can be used. The data capture system allows scanning, recognizing, and entering into databases, documents of different types including fixed (structured) forms and non-fixed (flexible or semi-structured) forms.
During simultaneous input of documents of different types, a type of each document should be preliminary identified and selected to choose a further processing method for each document according to its type.
Generally, there are two kinds of forms—fixed forms and flexible forms.
The same number and positioning of fields is typical for fixed forms. Forms often have anchor elements (e.g. black squares, separator lines). Examples of fixed forms or marked prepared forms include blanks, questionnaires, statements and declarations. To find the fields on a fixed form, form description matching is used.
Non-fixed forms or semi-structured forms may have a various number of fields that may be located in different positions from document to document, or from page to page. Also, an appearance of a document of the same type may be different, such as the formatting, design, size, etc. Examples of the non-fixed forms include application forms, invoices, insurance forms, payment orders, business letters, etc. To find fields on a non-fixed form, matching of flexible structural descriptions of a document is used. For example, recognizing flexible forms by means of structural description matching is disclosed in U.S. Patent Application having Ser. No. 12/364,266.
A preliminary classification is used to identify a document type taking into account possible differences. After the type of document is identified, the document may be sent to a further processing corresponding to its document type.