Recomposition methods for optical character recognition (OCR) products, in the look and feel of the original document is preserved in a word processor file format, are increasingly popular features. Leading OCR technologies are highly regarded for recomposition facilities; however, the analysis and output of cell structures for semi-ruled (semi-lined) and un-ruled (line-less) tables with a cell or cells having multiple lines of text is lacking in the art.
Table analysis is the task of converting an image of a table in a document to a marked-up electronic version suitable for conversion to a word processor format such as Microsoft Word.RTM.. A table is either found automatically, or identified by a user with a graphical user interface by selecting a table from an image displayed on a computer monitor. In either case, the system is supplied with the word bounding boxes and horizontal and vertical rulings and must recompose the table cells using only this geometric information, i.e., no character information need be available.
A cell is a list of one or more words comprising a logical entity in a table. Cells are delimited by rulings, gutters or leading (the last two words meaning white space in typography jargon). The words in a cell are in close proximity relative to the words in another cell. Methods to extract all necessary geometric information about the table region as well as the page on which it occurs are known in the art. The analysis process yields information describing the cells and table which consists of a list of unique word identifiers, the coordinates of the cell bounding box on the page, and indicators for left, right, top and bottom borders as to whether there are rulings to be drawn in the output or are invisible.
There are three types of tables: line-less, semi-lined and lined. FIG. 1 shows a line-less table. Logical quantities are grouped into cells forming rows and columns. FIG. 2 shows a semi-lined table. These may be somewhat easier to detect automatically given the long horizontal rulings. FIG. 3 shows a lined table. These are reliably detected automatically in commercially available TextBridge.RTM. software from Xerox Corporation, and is also described in U.S. Pat. No. 5,048,107 to M. Tachikawa entitled "Table Region Identification Method." FIG. 4 shows word bounding box information used to recover the table cell structure in FIG. 1.
The table identification method of Tachikawa is essentially a means of extracting runlengths, combining them into connected components and extracting large connected components as table candidates. Among these candidate regions, horizontal rulings are extracted by combining runlengths longer than some specified threshold and collecting those with length approximately the width of the connected component. If the number of rulings is greater than some threshold, the region is deemed a table. This procedure find only ruled tables. The chief advantage of this method appears to be speed and the ability to work with runlength compressed data, but this method can only find fully-lined tables.
In a paper by Itonori entitled "Table Structure Recognition based on Textblock Arrangement and Ruled Line Position", presented at the IEEE Second International Conference on Document Analysis and Recognition, Tsukuka, Japan, October 1993, a method of recognizing table structures from document images is disclosed. In Itonori, each cell of a table is arranged regularly in two dimensions and is represented by a row, column pair. The Itonori process expands cell bounding boxes and assigns new rows and column numbers to each edge. Itonori finds columns and rows using projections of character bounding boxes.
The table identification method of Green and Krishnamoothy, disclosed in a paper entitled "Recognition of Tables Using Table Grammars", presented at the Forth Annual Symposium on Document Analysis and Information Retrieval, in Las Vegas, Nev., USA, April 1995, identifies runlengths and page margins via a lexical analyzer that quantizes the proportion of black pixels in a scantine observation window. The lexical analyzer produces eight different tokens than are passed to a parser. Scanning can be done horizontally or vertically. The outcome is a set of vertical and horizontal rulings that are then used for table analysis. This analysis extends all rulings to the edges of the table, partitioning the table into elementary cells. Further analysis joins those cells which were not originally separated by a ruling. The result is as set of image regions corresponding to cells plus rulings. The method uses a grammar-based approach to identify the rulings and cells of a fully-lined table image. Recognition depends on having an explicit table model expressed as a grammar. This method does not handle fully-lined, semi-lined or line-less tables without recourse to an explicit table model, which must be created by a user. This method accesses the image pixels. Moreover, exploring all the parsing possibilities requires several seconds on a parallel computer.
The method of Douglas et al., disclosed in a paper entitled "Using Natural Languages Processing for Identifying and Interpreting Tables in Plain Text," also presented at the Fourth Annual Symposium on Document Analysis and Information Retrieval, in Las Vegas, Nev., USA, Apr. 26, 1995, uses natural language processing notions to represent and analyze tables. This process attempts to characterize the information contained within a table, regardless of its form. Several table transformations are listed with respect to which table information is invariant. Douglas et al. posit a canonical representation for tabular information. Douglas et al. process a particular class of well-structured tables, and their application is the interpretation of tabular information in the construction industry. There is a list of domain labels that appears a column headings in the canonical representation and a list of n-tuples of values, where n is the number of columns. The left-most column plays a special role as a place for high-precedence domain labels and values. Finding cells proceeds as follows. The data at hand are lines consisting of character bounding boxes and spaces between characters. Characters may be alphanumeric or otherwise, but a tag is kept to identify alphanumeric characters. A sequence of characters is content-bearing if it contains at least one alphanumeric character. Column breaks are determined by intersecting vertically overlapping lines. The spaces that survive intersection of all such lines are deemed gaps between columns. Whether or not the columns of those of a table (rather than columns of text) is determined through a set of rules that use alphanumeric density and column with relative to the with of the text body being analyzed. Within a column, adjacent lines are merged into cells. Once the cells have been determined and labeled with their unique column/row coordinates, the table is analyzed semantically using recognized characters. Domain knowledge (e.g., construction materials) is used to establish whether a phase is a domain label or a domain value and whether, based upon the cell's horizontal coordinate, a cell's semantic type is consisted with others in its column. This method is only intended for a specialized fully-lined table style used in the construction industry. Character information is needed in this method. This process does not try to identify the table structure as cells and separators independent of content or style.
In U.S. Pat. No. 5,502,777 to Ikemure, a means to determine whether a ruled area is a table or a figure is provided. The method compares the number of pixels comprising the horizontal and vertical rules in the region to the total number of black pixels in a binarized image. If the ratio is sufficiently large, a significant proportion of the pixels belong to rulings and thus the region is a table.
In a paper by Hori and Doermann entitled "Robust Table-form Structure Analysis based on Box-Driven Reasoning", prepared for the Document Processing Group, Center for Automation Research, University of Maryland, a method is disclosed for analyzing table-form documents which are full-lined. The task is to find all the cells, by which they mean the rectangles that are formed by the rulings and enclose strings of text. Their contribution is the ability to handle degraded documents where characters can overlap rulings. The algorithm operates on two versions of the binary image, one at the original scanned resolution and a reduced resolution obtained by summing over a small square moving window, thresholding and subsampling. In the reduced image, a pixel is black if any pixel in a square region about the corresponding pixel in the original is black. This has the effect of merging broken or dotted rulings; however, it introduces the problem of characters overlapping with lines of the form. Inner and outer boxes are obtained for the image. The boxes are then classified according to their size and aspect ratio into one of character, cell, table, bar, noise, character hole, and white character island. Some of these correspond to inner boxes (they bound white space) and outer boxes (bounding a connected component) or both. Inner boxes can be nested in out boxes and vice-versa. Box coordinates for the original and reduced resolution images are maintained. Cells are inner boxes and have outer boxes of strings nested inside. Boxes in the original are inspected for characters touching lines, and if so, they are separated. The boxes in the reduced image are more reliable in the sense that they are formed with broken and dotted lines rendered as solid lines. But they are also more likely to have touching characters. Boxes in the reduced and original images are compared and their differences reconciled. Strings are characters that are nested within the same cell. Character boxes are collected into lines of text. Since the cell coordinates do not match precisely the positions of the rulings, adjustments are made to line up the cells and their neighbors with rulings to avoid gaps and allow spaces for rulings to be drawn between the cells. The result is a collection of bounding boxes corresponding to an ideal version of the scanned table-form.
In U.S. Pat. No. 5,420,695 to Ohta, a method is disclosed which allows a user to edit a table by entering new column and row sizes on a digital input pad on a copier. This process must recognize a table and perform the proper "corrections" for output. Table detection uses inner and outer contours of binary images to determine the location of tables and cells within them. Once the cells have been identified, new rows or columns can be added or deleted per the users instruction. The intent is to provide a table-editing mechanism through a photocopier. If the table is semi-lined, simple cells are identified through histogram techniques using vertical and horizontal projection profiles. This method of inner and outer contour manipulation bears a similarity to the method of Hori and Doermann.
In a paper by Hirayama entitled "A Method for Table Structure Analysis Using DP Matching", presented at the IEEE Proceedings of the Third International Conference on Document Analysis and Recognition, Montreal, Canada, Vol. II, pages 583-586, Aug. 14-16, 1995, a method that detects and analyzes tables which have vertical and horizontal rulings is disclosed. The first task is to segment a binary document image into regions containing text, tables and figures. The first step in segmentation is to find the connected components of the runlength-smeared document image. Bounding boxes of the connected components are classified as vertical or horizontal lines, character strings or other objects according to their heights. Character strings are grouped together to form text regions. The remaining regions are non-text: tables or figures. Tables are required to have horizontal and vertical lines. Lines are grouped together when they intersect, are close and parallel, or their endpoints are close. The regions containing a group of linked lines are called table area candidates. A bounding box of rulings is added to the table region in case some cells are open. Within a table area candidate, all rulings are extended by virtual lines to terminate into the most extreme ruling. The table area is thus segmented into a "lattice" being composed of a grid of rectangles. Next, rectangles that are separated only by virtual lines are joined. The resultant polygons form cells if they are rectangular and enclose only character strings or are empty. Some polygons correspond to cells and others not, but the region as a whole is judged to be a table area if there is at least one non-empty cell and non-cell areas constitute a fraction of the total candidate area. Now in the lattice version of the table, there is a grid of m columns and n rows. The separators between these may be virtual. It is necessary to assign these virtual cells to proper table rows by aligning columns. Alignment is done pairwise from left to right using the well- known string-to-string correction dynamic programming algorithm where the weights for the substitution cost are distances in baselines between two text strings and there is a fixed insertion and deletion cost. For example, in FIG. 5 there are three columns and six virtual rows. With the deletion and insertion cost sufficiently low, the alignment algorithm matches string AAAA with DDDD and CCCC with FFFF in the first two columns. The string BBBB is "deleted" and string EEEE is "inserted." A new row is supplied to match BBBB. Continuing to columns two and three, the string HHHH doesn't have a match in the second column, so the algorithm searches the previous columns from right to left for a match. If none is found, a new row is supplied. The result in this example is that six rows are found.
U.S. Pat. No. 5,485,566 to Rahgozar discloses an algorithm for finding the columns of a tabular structure using only word bounding box information. The method uses intervals between word bounding boxes to estimate column breaks. Only the x coordinates are used. Starting with all the gaps in a tabular region of a document, all possible intersections are taken. This collection of intervals and their intersections (not including the null set) is called the closure. Each member of the closure has a rank, the number of original gaps it is a subset of. The members of the close which are small in some sense and have the highest rank form column breaks. Presumably, this method can be used for rows as well, but not for detecting the rows of a table with multiple line cells
In a paper by Rahgozar and Cooperman entitled "A graph-based table recognition system", SPIE Vol. 2660, pages 192-203, April 1996, it is disclosed that a graph rewriting techniques can be brought to bear on table identification and analysis. Graph grammars naturally describe notions of relative placement or alignment of cells. A table is a graph on cells and headings in a suitably chosen graph language. Table identification is the task of starting at a cell and choosing rewrite rules in advance until no more rules can be found. The result is a table since it is a sequence of productions from a start symbol. The sequence of productions produces information about the table structure, namely columns and rows. Rows can be found first by looking left and right for cells to merge.
Although prior art has progressed in the table recognition art, none of the prior art addresses the problem of identifying cells and cell separators in a manner that can handle multiple line cells and complex tables, such as tables containing substantial "white space". The art has not succeeded in accurately recognizing fully-lined, semi-lined and line-less cell tables. The art has can not handle multiple line cells in semi-lined and line-less table form. The prior art does not iteratively and carefully merge word boxes into cells, find separators, merge cells bounded by the same separators, update separators, and repeat these steps until the correct cell structure is found. It is therefore an object of this invention to provide a method of identifying cells and cell separators accurately during page recomposition processes that will overcome the short comings of the prior art it.
All of the references cited herein are incorporated by reference for their teachings.