The present inventive subject matter relates generally to the art of automated document processing. Particular but not exclusive relevance is found in connection with parsing images of tables and other unstructured representations of tables, e.g., such as may be found in a Portable Document Format (PDF) document, a Microsoft Word document, a HyperText Markup Language (HTML) document, etc. Where appropriate in the present specification, references to table images or images of tables or the like are intended to include such other unstructured representation. In any event, it is to be appreciated that aspects of the present inventive subject matter are also equally amenable to other like applications.
Tables commonly occur in many different varieties in many different types of documents, and they often contain important information. For example, business reports summarize vital information about balances, cash flow, and projections in tables. Invoices and receipts typically lay out the information about the purchases in a tabular form. Scientific papers often summarize key experimental results in tabular format. Healthcare documents and/or forms commonly contain tables as well.
Extracting information from such tables while preserving the table structure is useful for many applications. For example, the product names extracted from an invoice could be matched to a database to verify receipt before remitting payment. In the healthcare domain, claims processing could be assisted by extracting the information from tables on the claim forms. Such information extraction can also benefit other applications such as data mining and analytics. One difficulty in this data extraction task is that the tabular structure encodes important information which is not contained in the text of any individual cells. Therefore, simple Optical Character Recognition (OCR) of the table may recover the text, but not the structure of the table.
Currently, many businesses perform the aforementioned extraction task manually. This can lead to significant costs of document processing. For example, it has been estimated that the cost of processing a single invoice is not insignificant. In some cases, large businesses may process tens of thousands of invoices per day, which can result in disadvantageously high operating costs. Accordingly, some may find it desirable to reduce the manual effort involved in extracting information from tables in documents.
Accordingly, a new and/or improved method and/or system or apparatus for parsing images of tables is disclosed which addresses the above-referenced problem(s) and/or others.