The exemplary embodiments disclosed herein relate to document processing and find particular application in connection with a method and system for extracting a mathematical structure associated with a financial table included in a financial document.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit them to be used other than for viewing or printing. To provide greater accessibility to the content of such documents, it is desirable to understand their structure. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation, e.g., PDF (Portable Document Format) or Postscript representation, a loss of document structure usually results because the representation of the document is either at a very low level, e.g., bitmap, or an intermediate level, e.g., a document formatted in a page description language or a portable document format.
Geometric or physical page layout analysis can be used to recognize the different elements of a page, often in terms of text regions and image regions. Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image, i.e., layout objects. Such methods exploit the geometric or typographical features of document image objects, sometimes using of the content of objects and a priori knowledge of page layout for a particular document class. Geometric page layout analysis (GPLA) algorithms have been developed to recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy et al., “A PROTOTYPE DOCUMENT IMAGE ANALYSIS SYSTEM FOR TECHNICAL JOURNALS”, CSE Journal Article, Department of Computer Science and Engineering, pages 10-22, July, 1992 and the Smearing algorithm, described by Wong et al., “Document analysis system”, IBM Journal of Research and Development, volume 26, No. 6, pages 647-656, November, 1982. These GPLA algorithms receive as input a page image and perform a segmentation based on information, such as pixel information, gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. These methods are useful for segmenting pages one dimensionally, into columns.
In addition, as disclosed in U.S. patent application Ser. No. 13/911,452, filed Jun. 6, 2013, U.S. Publication No. 2014/0365872, published Dec. 11, 2014, by Hervé Déjean; entitled “METHODS AND SYSTEMS FOR GENERATION OF DOCUMENT STRUCTURES BASED ON SEQUENTIAL CONSTRAINTS”, a method and system is provided that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams, i.e., a sequence of n features, are computed from a set for n contiguous elements, and n-grams which are repetitive, e.g., Kleene cross, are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
A common task in document analysis is extracting data from an unstructured document, sometimes referred to as indexing. The extracted data can correspond to a single piece of text, such as an invoice number, or to structured data including several fields, such as an invoice item having a description, price per unit, total amount, etc.
As disclosed in U.S. patent application Ser. No. 14/955,410, filed Dec. 1, 2015, by Hervé Déjean and entitled “METHOD AND SYSTEM FOR GENERATING A GRAPHICAL ORGANIZATION OF A PAGE”, this structured data is referred to as sdata (structured data) and a primary issue in extracting structured data is the lack of correspondence between the sdata/data fields and the way their layout is performed, except for documents which mostly follow a layout template such as forms. In some documents, one homogeneous block can contain all the data fields. In another document, each field may be spread over table cells. No generic algorithm can systematically provide segmentation where found layout elements correspond to a single sdata. An analysis combining layout information and content information is then required to identify complete sdata. U.S. patent application Ser. No. 14/955,410, filed Dec. 1, 2015, by Hervé Déjean and entitled “METHOD AND SYSTEM FOR GENERATING A GRAPHICAL ORGANIZATION OF A PAGE” provides a method and system to generate a graphical organization of a page which can then be further processed to extract data or perform other processing to extract information from the generated graphical organization descriptions.
With regard to the extraction of financial data, manners of extracting financial data from unstructured tabular document are provided by U.S. Pat. No. 5,893,131, by Kornfeld, issued Apr. 6, 1999 and entitled “METHOD AND APPARATUS FOR PARSING DATA”; U.S. Pat. No. 6,336,094 by Ferguson et al., issued Jan. 1, 2002 and entitled “METHOD FOR ELECTRONICALLY RECOGNIZING AND PARSING INFORMATION CONTAINED IN A FINANCIAL STATEMENT”; U.S. Pat. No. 7,653,871, by LaComb et al., issued Jan. 26, 2010 and entitled “MATHEMATICAL DECOMPOSITION OF TABLE-STRUCTURED ELECTRONIC DOCUMENTS”; and U.S. Pat. No. 7,856,388, by Srivastava et al., issued Dec. 21, 2010 and entitled “FINANCIAL REPORTING AND AUDITING AGENT WITH NET KNOWLEDGE FOR EXTENSIBLE BUSINESS REPORTING LANGUAGE”. These patents all address data extraction from financial documents, and more precisely, data presented in the financial statements such as balance sheets, cash flow statements, and income statements for U.S. public companies. Almost all financial statements are organized by accounting categories including assets, liabilities, and equities for balance sheets. This taxonomy is often hierarchical, including 7 levels and more for some financial tables. Table 1 shown in FIG. 13 provides an example of a financial statement including balance sheets. Beyond the traditional issue of layout analysis table extraction to delimit the table and recognize the internal structure of the table in row and columns, a financial table understanding process must basically categorize the financial data into line items and identify (sub-)totals included in the table. It is valuable to detect the mathematical structure since it reflects the hierarchical row organization, where a subtotal is linked to a sub-category.
In order to detect the mathematical structure associated with a financial table, all the above mentioned methods, except U.S. Pat. No. 5,893,131, by Kornfeld, issued Apr. 6, 1999 and entitled “METHOD AND APPARATUS FOR PARSING DATA”, are mainly based on keywords and use mathematical properties in order to validate a posteriori the resulting structure. In other words, these methods determine if a total really corresponds to the sum of the elements of its section. Detection of a “Total Line” is often based on keyword such as ‘total’ or based on the fact that the line item has no label. Each method processes multi-lines items with its own set of heuristics, based on textual and typographical features.
In order to detect mathematical relationships, U.S. Pat. No. 7,653,871, by LaComb et al., issued Jan. 26, 2010 and entitled “MATHEMATICAL DECOMPOSITION OF TABLE-STRUCTURED ELECTRONIC DOCUMENTS” uses a top-down approach with strong prior knowledge, where only a balance sheet statement is covered, knowing its three main top categories (assets, liabilities, equities). Others use a greedy approach which can be used when then number of elements is not too large. See U.S. Pat. No. 7,856,388, by Srivastava et al., issued Dec. 21, 2010 and entitled “FINANCIAL REPORTING AND AUDITING AGENT WITH NET KNOWLEDGE FOR EXTENSIBLE BUSINESS REPORTING LANGUAGE”. All these methods focus on a specific financial statement: Balance Sheet and addressing other tables involves updating lexical resources.
Provided herein is a method and system of extracting a mathematical structure associated with a financial document, i.e., financial table, which is not limited to a specific type of financial document, such as a balance sheet, cash flow statement, etc.