Key information can be contained within tables that are themselves embedded in documents, whether full-text journal articles, patents, slides or health records. For example, important experimental results may be contained within a table in a PowerPoint presentation, or key lab values relevant to a patient may be contained within a table in an electronic health record. Information contained within tables is hard to extract automatically with high accuracy due to the wide variety and low quality of typical tables found in electronic documents.
One particular difficulty in extracting information contained within tables arises from the way in which table structures are typically represented in semi-structured formats like SGML, HTML, document or presentation formats such as Word or PowerPoint or various XML formats (e.g., XHTML, XML OASIS or CALS table models). Cells can span multiple rows or columns, and even for simple cells there is no association between the cell and its respective column and row headers.
Another difficulty arises from the fact that many tables found in electronic formats contain representation errors. These can arise from a variety of factors, including imperfect optical character recognition (OCR) and the breaking apart of cells to improve the readability of items within a table.