The present invention relates to a method for recognizing tables within a document, and more specifically, to a method for automatically detecting tabular data in semi-structured documents using text coordinates.
In a time when documents can exist in many forms and formats, the need for automatic document conversion software that can convert between these different formats has increased dramatically. One type of information in documents that is difficult to detect and convert accurately are tables. As explained herein below, the prior art approaches to convert tables offer only tolerable solutions that have much to be desired.
This is unfortunate since tables are useful in conveying much information in a compact format. One reason for the effective nature of conveying information through tables is that part of the information is presented by the structure of the table and is in fact inherent in the table structure. For example, column headings, row headings, table title, and the grouping of the information all can convey important information.
Since tables by their very nature convey information by their structure, it is important that any document conversion software accurately reflect the original structure in the converted form. As will be described herein below, some conversion software cannot handle the table structure and presents the table data as regular text, thereby stripping the structure that existed in the original table. As can be appreciated, much information is lost in such an approach. Other software attempts to convert the table and retain the structure, but do so poorly. For example, if a conversion software handles tables poorly, information can be presented inaccurately. For example, if a converted table has values that actually belong in a first column (i.e., the values are in the first column in the original document) mistakenly transported to another column, then the converted table provides incorrect data. In the best case, the information is obviously wrong and can be easily detected as such, and ignored by one who reads the document. However, in a more detrimental case, if the error is not obvious, then the one reading such a document can rely on the erroneous information to his or her peril. From the above, it can be seen that the accurate detection and conversion of tables from a document in a first format to a document in a second format are important tasks that, unfortunately, pose challenging problems to existing conversion software. There are currently several unsatisfactory approaches to this problem.
U.S. Pat. No. 5,841,900, entitled xe2x80x9cMethod for Graph-Based Table Recognition,xe2x80x9d describes a bottom-up approach for recognizing tables in documents. In this approach, the document is first transformed into a layout graph with nodes and edges that represent document entities and their interrelations, respectively. Next, the layout graph is re-written using a set of rules based on apriori document knowledge and general formatting conventions. The graph is then utilized to locate tables in documents.
This bottom-up approach has several disadvantages. First, although the ""900 patent provides a more efficient way of transforming documents into corresponding layout graphs, this approach is nevertheless more computationally intensive than an approach that does not need a layout graph. In addition, segmenting every document into a corresponding layout graph with its objects is a generally complex programming process and is not easily implemented. Second, the step of re-writing the graph requires access to a set of rules and formatting conventions that consume additional memory.
Some document conversion programs attempt to perform automatic document conversion from one format to another. For example, there are commercial products that attempt to convert text in Adobe Portable Document Format (PDF) to Hypertext Markup Language (HTML) Unfortunately, these products handle tables very poorly. In fact, these products xe2x80x9cflatten the tablexe2x80x9d (i.e., these products represent tables as straight text with no structure whatsoever). For example, a table having four rows and four columns would be converted to four lines of straight text. As discussed previously, it is undesirable to remove the table structure since removing the structure causes important information conveyed by the table structure or inherent therein to be lost.
Other document conversion software programs require a user to manually identify where the tables are in a document so that the tables can be converted to a structured form. For example, document conversion software programs, such as Gemini from Iceni Technology Limited of Norwich, England or Redwing from Datawatch, Inc. of Lowell, Massachusetts both require manual intervention in order to perform table conversion. Manual intervention is undesirable for at least two reasons. First, manual intervention consumes a user""s time and effort. Second, manual intervention prevents the ability to process document conversion off-line, such as by utilizing batch processing. Batch processing is particularly important in instances where there are numerous documents to convert from one form to another.
Based on the foregoing, it is clearly desirable to provide an apparatus and method for efficiently and automatically detecting and converting tables in documents. In particular, it is desirable to provide a method for efficiently and automatically detecting tables in documents that are semi-structured (i.e., described by a page description language) and for converting these tables into a markup language.
The present invention provides a method for automatically detecting table data in a document that is described by a page definition language and converting the table data into a markup language representation while preserving the structure of the table. The document may have one or more pages. The page definition language of the document, which can be the Portable Document Format, provides a list of words, the start position of each word on the page with respect to a predetermined reference point located on that page, and the size of each word.
According to the method, the present invention automatically identifies table data in the document by utilizing one or more table-identifying features. A first table-identifying feature may be the number of word clusters on a line. A second table-identifying feature may be the vertical alignment of word clusters between lines. A third table-identifying feature may be the changes in text density or space density between lines. In addition, the automatic identification technique of the present invention may also use one or more heading rows at the top of the table as a fourth table-identifying feature. Also, a fifth table-identifying feature can be the drawing lines that separate different data elements in the table.
In the presently preferred embodiment, the alignment of word clusters between lines is utilized to automatically generate a table bounding box for each table. Next, the table bounding box is then expanded in a first direction based on changes in word density among the lines immediately preceding the top edge of the table. For example, the table bounding box can be expanded in a first direction to encompass a previously marked line that had a significant change in text density.
The table bounding box is also expanded in a second direction based on changes in word density among the lines immediately succeeding the bottom edge of the table. For example, the table bounding box may be expanded in a second direction to encompass a previously marked line that had a significant change in text density.
This step expands the table bounding box in the positive and negative y-directions to more accurately reflect the true beginning and end of the table. The text that is encompassed by the expanded table bounding box is then converted to a markup language representation with table tags, thereby preserving the structure of the table.