The present invention relates generally to the field of table ingestion from documents, and more particularly to the optimization of the ingestion of tables from a document using data and metadata analysis to determine uniqueness.
Data analytics examines data in order to draw conclusions about the analyzed information. Data is commonly presented in tables and may make direct analysis more complex. For many domains such as science, medicine, and finance, context for tables may be as critical to understanding the data as the data itself. Difficulty in processing, or ingestion, tables may come from a document or set of documents containing tables with various formats or styles. Tables analyzed using Optical Character Recognition (OCR), or Object Linking and Embedding (OLE), may contain errors in the data conversion.
Ingestion of well-defined HTML tabular data may be less costly and resource intensive than the ingestion of a pictorial table ingested via OCR extraction. If there is a document with a combination of formats, an ingestion system may need to defer to the well-defined tabular data to ensure quality, wherein this may not yield the most desirable information, as a table with a less common or less well-defined formatting may contain desirable information within the context of the document.