The exemplary embodiments disclosed herein relate to document processing and find particular application in connection with a method and system for extracting data from a digital version of a document. Specifically, according to an exemplary embodiment, a method and system is provided to extract data from a document including a tabulated layout, e.g., forms, invoices, etc.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit them to be used other than for viewing or printing. To provide greater accessibility to the content of such documents, it is desirable to understand their structure. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation, e.g., PDF (Portable Document Format) or Postscript representation, a loss of document structure usually results because the representation of the document is either at a very low level, e.g., bitmap, or an intermediate level, e.g., a document formatted in a page description language or a portable document format.
Geometric or physical page layout analysis can be used to recognize the different elements of a page, often in terms of text regions and image regions. Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image, i.e., layout objects. Such methods exploit the geometric or typographical features of document image objects, sometimes using of the content of objects and a priori knowledge of page layout for a particular document class. Geometric page layout analysis (GPLA) algorithms have been developed to recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy et al., “A PROTOTYPE DOCUMENT IMAGE ANALYSIS SYSTEM FOR TECHNICAL JOURNALS”, CSE Journal Article, Department of Computer Science and Engineering, pages 10-22, July, 1992 and the Smearing algorithm, described by Wong et al., “Document analysis system”, IBM Journal of Research and Development, volume 26, No. 6, pages 647-656, November, 1982. These GPLA algorithms receive as input a page image and perform a segmentation based on information, such as pixel information, gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. These methods are useful for segmenting pages one dimensionally, into columns.
In addition, as disclosed in U.S. patent application Ser. No. 13/911,452, filed Jun. 6, 2013, by Hervé Déjean, entitled “METHODS AND SYSTEMS FOR GENERATION OF DOCUMENT STRUCTURES BASED ON SEQUENTIAL CONSTRAINTS”, a method and system is provided that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams, i.e., a sequence of n features, are computed from a set for n contiguous elements, and n-grams which are repetitive, e.g., Kleene cross, are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
A common task in document analysis is extracting data from an unstructured document, sometimes referred to as indexing. The extracted data can correspond to a single piece of text, such as an invoice number, or to structured data including several fields, such as an invoice item having a description, price per unit, total amount, etc. For purposes of this disclosure, this structured data is referred to as sdata (structured data).
A primary issue in extracting structured data is the lack of correspondence between the sdata/data fields and the way their layout is performed, except for documents which mostly follow a layout template such as forms. In some documents, one homogeneous block can contain all the data fields. In another document, each field may be spread over table cells. No generic algorithm can systematically provide segmentation where found layout elements correspond to a single sdata. An analysis combining layout information and content information is then required to identify complete sdata.
This disclosure provides a method and system to extract data from documents including a tabulated layout, such as forms, especially invoices. The method and system targets a specific data type called label:value, the label part corresponding to a string representing the data label, associated to its value part.