The number of documents that are exchanged between different businesses is increasing very rapidly. Every institution, be it a commercial company, an educational establishment or a government organization receives hundreds and thousands of documents from other organizations every day. All these documents have to be processed as fast as possible and information contained in them is vital for various functions of both receiving and sending organizations. It is, therefore, highly desirable to automate the processing of received documents.
There are many document classification systems known in the art. The references described below and the art cited in those references is incorporated in the background below.
There are at least two ways of interpreting the term “classification”. One relates to classifying documents into groups having similar context. Normally it means documents having similar collections of related keywords. This is sometimes called categorization. Another way of classifying documents treats documents as similar if they have similar layouts. It is this latter classification that the present invention is concerned with.
The existing classification patents could themselves be classified into several groups. One group is concerned with classifying electronic documents (normally not images of documents, but electronically generated documents) in the natural language processing context where the features used for classification are words (keywords) that are present in documents to be classified, and their attributes such as frequency of their occurrence. To this category belong, for example U.S. Pat. No. 6,976,207 and U.S. Pat. No. 6,243,723. Another group deals with methods of using combinations of multiple different classifiers to achieve better results than a single classifier (U.S. Pat. No. 7,499,591, U.S. Pat. No. 6,792,415). Yet another group targets optimization of classification features that provide a more efficient separation of various classes of documents in the feature vector space, as exemplified by U.S. Pat. No. 7,185,008, so that it is easier to separate the documents themselves. There are also patents that classify documents in genres such as advertisements, brochures, photos, receipts, etc. (US Patent Application 2010/0284623). Yet another group of patents attempts to extract features (from document images) that can be useful in classifying various types of documents (U.S. Pat. No. 5,555,556). There are also patents that prescribe using layouts for document classification and identification. To this category belong U.S. Pat. No. 6,542,635 and US patent application US2004/0013302, U.S. Pat. No. 6,721,463 and references cited therein. US Patent Application 2009/0154778 discloses a system for identification of personal identity documents such as passports. U.S. Pat. No. 6,542,635 teaches a method of classifying documents into types such as letters, journals and magazines by first segmenting their images into blocks of text and white space and uses hidden Markov models to train classifier to distinguish between these categories. Layout is defined as a unique fixed vector scheme encoding each row of text. It does not address the problem of identifying documents having fixed layouts that originate from the same printing program or source and it does not utilize features other than text blocks. U.S. Pat. No. 6,721,463 prescribes using ruled lines and document titles for document classification and ignores other elements present in the document. US Patent application 2004/0013302 builds a layout graph model which utilizes such specific features as fonts and font sizes and leaves out geometric lines as informative features. The classification is based on comparison of layout graphs. There are known in the art document classification systems (for example U.S. Pat. No. 6,243,723) that require a human to manually set up features salient for classification namely those features that would be present in one type of documents and absent in others, such as specific logos, or specific keywords, company names, etc. All described patents are incorporated herein as references.
Unlike methods deployed in the prior art, the present invention teaches a totally automatic method of classifying documents which originate from a specific printing program (such as an invoice printing software or an explanation of benefits printing program or a bill of lading printing system). These documents typically exhibit a specific pre-programmed layout. The layout in this context means a specific geometric configuration of isolated text blocks and their interrelations, geometric lines and their interrelations and the contents of text blocks. Thus the prior art either addresses a different problem of classifying documents into more general classes or genres such as letters of journal articles, or ignores some vital information useful in classification. In contrast, the present invention overcomes difficulties of the prior art by a fully automated method and system that effectively utilizes classification-critical information.