1. Field of the Invention
The present invention relates generally to a method for converting a document stored in one format to a different format. More specifically, a system and method for converting digital data representing an image of a document image stored in one format to other formats for manipulation and display are disclosed.
2. Description of the Related Art
Automatic processing of digital data representing an image of a document using a digital computer to recognize, capture and/or store information contained in the document has been the subject of active research and commercial products. For example, U.S. Pat. No. 5,737,442 issued on Apr. 7, 1998 to H. Alam discloses a processor based method for recognizing, capturing and storing tabular data from digital computer data representing a document, the disclosure of which is incorporated herein by reference in its entirety.
However, many other image processing research and products have not focused on the accurate, efficient and automatic capturing of the information contained in a document and converting the document to a different format to be displayed, for example. Nor have other image processing research and products focused on allowing the user to manually or otherwise reformat and/or revise the contents of the document. Further, such image processing research and products have also not focussed on the conversion of such information to a format that a user may easily manipulate in order to utilize all or a portion of the information contained in the document and/or to reformat the document as desired into a different layout. For example, it may be desirable for the user to manipulate the document by cutting, pasting and/or otherwise editing or revising the document to reformat and/or to fully or partially utilize the information contained in the document such as for analysis and/or other uses.
What is needed are accurate and efficient systems and methods for converting a document stored in one format to a different format. Such systems and methods preferably convert digital data representing an image of a document image stored in one format to other formats for manipulation and display, for example.
The present invention comprises a method for extracting data from digital data representing a document, such as a printed document or of an Internet webpage. The method generally comprises locating words from the digital data of the document in the original or input format, joining the located words into lines, joining the lines into paragraphs, locating tables from the joined paragraphs, converting the paragraphs and tables to an intermediate format, and outputting the information into an output format. The input and output formats may be, for example, portable document format (PDF), rich text format (RTF), hypertext markup language (HTML) format with style sheets, tabular HTML, extensible markup language (XML), cascading style sheets (CSS), Netscape Layers, linked and separate pages, Tag Image File Format (TIFF) or any other image format such as graphics interchange format (GIF), bit map (BMP), or Joint Photographic Experts Group (JPEG), formats generated by text and/or image authoring tools or applications, or any other suitable formats.
A computer implemented method of converting a document in an input format to a document in a different output format is disclosed. The method generally comprises locating data in the input document, grouping data into one or more intermediate format blocks in an intermediate format document, and converting the intermediate format document to the output format document using the intermediate format blocks. Preferably, the grouping includes locating words in the input document, joining words satisfying line threshold to into lines, joining lines satisfying paragraph threshold into paragraphs, and locating tables. The grouping may alternatively or further include locating tags (or control characters) in the input document and utilizing the tags in locating words, joining words into lines, joining lines into paragraph, and locating tables. Each intermediate format block may be selected from a word, a line, a paragraph, a table, and an image.
Each of the input format and output format may be in portable document format (PDF), rich text format (RTF), hypertext markup language (HTML), extensible markup language (XML), cascading style sheets (CSS), Netscape Layers, linked and separate pages, Tag Image File Format (TIFF), graphics interchange format (GIF), bit map (BMP), Joint Photographic Experts Group (JPEG), MICROSOFT WORD(trademark), WORD PERFECT(trademark), AUTOCAD(trademark), and POWER POINT(trademark).
In one embodiment, the input document is received over a network and the output document is sent over the network, the network may be the Internet or an intranet, for example, via electronic mail. Heading of the input document may be located to generate a linked table of contents page containing the headings, each table of contents heading containing a link to the heading contained in the output document, the table of contents page being placed into the output document.
In another embodiment, a computer executable program, such as a JAVA(trademark) script, may be generated for selecting one output format for displayed, the program being inserted into the output document.
The methods of the present invention may be implemented by computer codes stored on a computer readable such as CD-ROM, zip disk, floppy disk, tape, flash memory, system memory, hard drive, and data signal embodied in a carrier wave.
The output document, for example, may be displayed by locating sub-page breaks in the document, subdividing the document into sub-pages using the sub-page breaks, locating blocks within each sub-page, and sequentially displaying all or a portion of each block of the sub-pages within display parameters of a display configuration. Tables may be divided to be displayed in more than one display page. A linked table of contents and/or a linked index may also be generated.
In another embodiment, the converter may be incorporated in a computer program product for maintaining a repository of input documents in one or more storage formats. A table of contents and/or an index may also be generated.