Conventional approaches to text-to-speech translation of electronic documents typically involve first transforming the data from one of a wide variety of document formats (i.e. picture elements (pixels) of scanned documents or proprietary formats such as those used by MS Word™or Word Perfect™) into a more universal format, such as ASCII text, by using optical character recognition (OCR) algorithms. The translated data block is then presented to a speech creation mechanism.
While such techniques work well for contiguous blocks of text, the presence of tables within such documents typically results in an indecipherable block data for each table region, and effectively renders a text-to-speech system useless. Further, since the data contained in the table region cannot be identified, queries for extracting any information contained in those tables cannot be answered. Thus, special algorithms for automatically identifying and translating table regions in electronic documents have been promulgated.
Such algorithms have traditionally depended on either the detection of ruled border lines or on an analysis of organized patterns of blank spaces, or columnization of data, between text characters that represent cells of the table. Once a table is delineated and the text cells defined, the information contained in the cells could be made available to electronic queries and for importing to database processing applications.
Although most of the work in this field is related to extracting information contained in scanned binary images, the problem of table detection in text files has also been addressed. The problem with conventional approaches is that they tend to address only narrow issues related to the characteristics of a particular application, and a universal method for detecting and using tables across all applications has not heretofore been available. The reason for the difficulty in defining a single solution algorithm is that applications may or may not contain: tables; border lines; a fixed number of blank spaces between columns; multi-line rows; multi-line column headers; or a clearly vertical column definition due to skewing.
In U.S. Pat. No. 5,737,442, to Alam, discloses an algorithm using character/space content of a line or group of lines for identifying columnization of characters along white space “plumb lines” and the subsequent use of “white-space vector intersections” that can be processed against a maximum/minimum criteria to identify the table structure. Text areas are grouped into rectangles, and the plumb lines are created as being centered on the white space between these rectangles. The principal disadvantage of such an approach is that the dependency on orthogonal white spaces and row separator lines can prevent the reformulation of the table when such white spaces are either irregular, missing, or less than a minimum “acceptance criteria.”