The number of documents that are exchanged between different businesses is increasing very rapidly. Every institution, be it a commercial company, an educational establishment or a government organization receives hundreds and thousands of documents from other organizations every day. All these documents have to be processed as fast as possible and information contained in them is vital for various functions of both receiving and sending organizations. It is, therefore, highly desirable to automate the processing of received documents.
There are many document classification systems known in the art. The references described below and the art cited in those references is incorporated in the background below. There are at least two ways of interpreting the term “classification”. One relates to classifying documents into groups having similar context. Normally it means documents having similar collections of related keywords. This is sometimes called categorization. Another way of classifying documents treats documents as similar if they have similar layouts. It is this latter classification that the present invention is concerned with.
In U.S. Pat. No. 8,831,361 B2 was described a system for commercial document image classification. However, optimal selection of training images was not addressed in this patent. The problem of optimal selection of training images arises also in such commercial forms as Fannie Mae 1003 Uniform Residential Loan Application where permanent information (layout elements) on the form is mixed with variable information that changes within the documents of the same type that must be classified as belonging to the same class of documents. If variable elements of the layout are participating in the classification process they can considerably impair the results of classification. Therefore, it is desirable to use only permanent elements of the layout for classification purposes and ignore the variable ones. The present invention discloses a method of using only permanent information of the documents for these purposes. U.S. Pat. No. 8,831,361 B2 is incorporated herein as a reference.
The present invention discloses a totally automatic method of generating training images for classifying documents which originate from a specific printing program (such as an invoice printing software or a form such as Fannie Mae 1003). These documents typically exhibit a specific pre-programmed layout. The layout in this context means a specific geometric configuration of isolated text blocks and their interrelations, geometric lines and their interrelations, the contents of text blocks or keywords such as legends pre-printed on forms (e.g. name of borrower, social security number, etc.).