From credit card statements, to hospital bills, to auto repair invoices, most of us encounter printed documents containing complex, but mostly regular, data structures on a daily basis. For organizations such as businesses; the federal government; research organizations, and the like, processing data obtained in printed form from various sources and in various formats consumes substantial resources. Both manual and custom/automated solutions have been practiced. Manual solutions are highly resource-intensive and well known to be susceptible to error. Automated solutions are typically customized to a particular form and require source code changes when the subject form changes.
The structural patterns present in such documents are naturally detectable by most of us after a brief examination. Repetitive blocks of data elements typically have a distinct appearance thought through by those who designed the document with the ostensible objective of readability. In addition, for our ability to make up for broken characters and sentences, we can also adjust for small irregularities in the layout of the data. For example, a long table listed on several pages with the sequence of table rows split by page footers and headers.
Our understanding of language helps us in interpreting the content of such documents. For example, most of us have little trouble in distinguishing table header information from table body data. A message such as “continued on reverse side” is readily interpreted to indicate that more data is to be expected on the following page. Also, a reader would not likely confuse “71560” with a date or zip code if it is preceded by “PO BOX.”
Our common knowledge of table structure aids us in distinguishing meta-data from data. We expect to find header information at the top of a column in cases where data descriptors do not appear immediately to the left of the data. Small print, special fonts, italics, and boldface type also make a difference in readability of documents containing tabular information. Knowledge of data formats, postal addresses, variations in date forms, meaning of names and abbreviations, spatial clues, and the combinations of these and other features help us in manual processing of documents exhibiting regular structure.
Besides the regular and expected complexity of document and table structures, documents may pose additional challenges for automating the data extraction process. The challenges include sparse tables, tables with rows spanning a varied number of lines, parts of a row not present (missing data elements, lines), extraneous text (special printed notes or handwritten annotations), varied number of records per document page, and records broken by the end of a page. In addition to irregularities related to record structure, such as the previous ones, common problems related to scanning (e.g., skewed and rotated images), as well as OCR errors should be anticipated.
In an illustrative example, FIG. 1 illustrates a multi-page “claim detail section” 100 of a document broken by the end 102 of page 45 101. The break 102 occurs in the middle of a table 104. After the unfinished table, on each page, totals 106 for the page are included. The table is continued on the next page 103 after page header information 108 and an abbreviated identification 110 of the continued record.
Among various research fields that deal with tables are the image analysis and information extraction fields.
Most of the image analysis methods focus on low-level graphical features to determine table segmentation. Some methods employ a line-oriented approach to table extraction. In those methods, lines or other graphical landmarks are identified to determine table cells. Other methods employ a connected component analysis approach.
For example, in the image analysis field, a box-driven reasoning method was introduced to analyze the structure of a table that may contain noise in the form of touching characters and broken lines. See Hori, O., and Doermann, D.S., “Robust Table-form Structure Analysis Based on Box-Driven Reasoning,” ICDAR-95 Proceedings, pp.218–221, 1995. In that method, the contours of objects are identified from original and reduced resolution images and contour bounding boxes are determined. These primary boxes and other graphical features are further analyzed to form table cells.
Another category of image analysis approaches accepts input from optical character recognition. In one example, table structure recognition is based on textual block segmentation. Kieninger, T. G., Table Structure Recognition Based on Robust Block Segmentation,” Proceedings of SPIE, Vol. 3305, Document Recognition V, pp. 22–32, 1998. One facet of that approach is to identify words that belong to the same logical unit. It focuses on features that help word clustering into textual units. After block segmentation, row and column structure is determined by traversing margin structure. The method works well on some isolated tables, however it may also erroneously extract “table structures” from non-table regions.
Despite many years of research toward automated information extraction from tables (and the initial step of recognizing a table in the first place), the problems have still not been solved. The automatic extraction of information is difficult for several reasons.
Tables have many different layouts and styles. Lopresti, D., and Nagy, G., “A Tabular Survey of Automated Table Processing,” in Graphics recognition: Recent Advances, vol. 1941 of Lecture Notes in Computer Science, pp. 93–120, Springer-Verlag, Berlin, 2000. Even tables representing the same information can be arranged in many different ways. It seems that the complexity of possible table forms multiplied by the complexity of image analysis methods has worked against the production of satisfactory and practical results.
Even though image analysis methods identify table structures and perform their segmentation, they typically do not rely on understanding about the logic of the table. This part is left to the information extraction field. In his dissertation, Hurst provides a thorough review of the current state-of-the-art in table-related research. Hurst, M. F., “The Interpretation of Tables in Texts,” PhD Thesis, 301 pages, The University of Edinburgh, 2000. Hurst notes that table extraction “has not received much attention from either the information extraction or the information retrieval communities, despite a considerable body of work in the image analysis field, psychological and educational research, and document markup and formatting research.” As possible reasons, viewed from an information extraction perspective, Hurst identifies lack of current art and model, no training corporal, and confusing markup standards. Moreover, “through the various niches of table-related research there is a lack of evolved or complex representations which are capable of relating high- and low-level aspects of tables.”
The problem of table analysis has been approached from two extremely different directions: one that requires table understanding and another that does not require table understanding. Table understanding typically involves detection of the table logic contained in the logical relationships between the cells and meta descriptors. Meta descriptors are often explicitly enclosed in columns and stub headers or implicitly expressed elsewhere in the document. The opposite approach requires little or no understanding of the logic but focuses on the table layout and its segmentation. This dual approach to table processing is also reflected in patent descriptions.
One group of patents concentrates on the image processing side. For example, Wang et al. in U.S. Pat. No. 5,848,186 analyzes an image to build a hierarchical tree structure for a table. The table structure is constructed as text in the table is detected and arranged in groups reflecting column and row organization. The table structure emerges to some degree but there is no effort to attach any functionality to the extracted groups of texts. Wang, S Y., and Yagasaki, T., “Feature Extraction System for Identifying Text Within a Table Image,” U.S. Pat. No. 5,848,186, Dec. 8, 1998.
Another example of a patent with the focus on image processing is one by Mahoney in U.S. Pat. No. 6,009,196. Mahoney, J. V., “Method for Classifying Non-running Text in an Image,” U.S. Pat. No. 6,009,196, December 1999. A stated objective of that patent is to provide classification of document regions “as text, a horizontal sequence, a vertical sequence, or a table.” The method does not appear to perform any data extraction.
A second group of patents concentrates on retrieving tabular data from textual sources. In general, graphical representation of the document is ignored and what counts is mainly text including blanks between texts. For example, in U.S. Pat. No. 5,950,196 by Pyreddy, table components, such as table lines, caption lines, row headings, and column headings are identified and extracted from textual sources. Pyreddy, P., and Croft, B., “Systems and Methods for Retrieving Tabular Data from Textual Sources,” U.S. Pat. No. 5,950,196, September 1999. The system may produce satisfactory results with regard to the data granularity required for human queries and interpretation. However, it would not likely be applicable for database upload applications.
One approach that appears to be missing from the references is to exploit the synergy between our intuitive understanding of documents and advances in image processing and information retrieval. Using a user's input to indicate structural features and a computer's processing power to search out and extract data from such structures offers a promising approach to information extraction from documents exhibiting regular data structures.