Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. One of the most commonly used document formats on the WWW (World Wide Web) and otherwise is the well-known PDF (Portable Document Format) standard. In fact, a large amount of legacy documents are now available “online” because of scanning devices that enable scanning of legacy documents to generate electronic copies (e.g., bit map images) of such documents.
For instance, the Acrobat suite of applications from Adobe allows a user to capture a document and generate a PDF file of the document. The user can then view the PDF file with the Acrobat viewer, for example, and view the document in its original format and appearance. The Acrobat application includes a toolkit that allows a user to scan in legacy documents, or otherwise capture documents created with various desktop publishing products. This enables a user to make such documents available “online” as a PDF file.
Electronic documents such as scanned legacy forms, however, are typically stored in formats (e.g., bitmap representations, GIF, TIFF, etc.) that do not include important structure or format information. Without extracting and saving structure/format information for the electronic document, however, the file can be unusable for various applications. Further, electronic files such as bit map images can be extremely large in size, which can cause problems with respect to storage and transmission bandwidth, for example, for use of such files in a networked environment.
There have been methods proposed for generating formatting information for electronic documents. For instance, one method proposed by Pavlidis, et al, in “Page Segmentation and Classification,” Computer Vision, Graphics and Image Processing, 54:375-390, 1991, includes analyzing scanned bitmap images to perform classification of the document using a priori knowledge associated with the document's class. It is noteworthy that to date, there has been virtually no research performed in using postscript as a starting point for document analysis. Certainly, if a postscript file is designed for maximum raster efficiency, it can be a daunting task even to reconstruct the reading order for the document. The previous researchers may have assumed that a well structured source text would always be available to match postscript output and, consequently, working bottom-up from postscript would seldom be necessary. However, PDF documents, for example, can be generated in a variety of ways including using OCR (optical character recognition) on bit-mapped pages. It should be appreciated that the additional structure in PDF, over and above that in postscript, can be utilized towards the goal of document understanding. As explained below, the present invention utilizes knowledge of PDF structure to provide efficient methods for extracting relevant form information from PDF files.
Many conventional methods for generating formatting information relate to understanding raster images. However, being an inverse problem by definition, such task can not be performed completely without making broad assumptions (see Kasturi, et al., “A System for Interpretation of Line Drawing,”, IEEE Transaction on Pattern Analysis and Machine Intelligence). Direct application of such methods on PDF documents would make little sense because such methods are not designed to make best use of the underlying structure of PDF files and would thus produce unacceptable results.
In contrast to conventional methods based on geometric layout analysis, conventional methods based on logical layout analysis have received very little attention. Some methods that use logical layout analysis perform region identification or classification in the derived geometric layout. These approaches, however, are primarily rule based (see, e.g., Krishnamoorthy, et al., “Syntactic Segmentation and Labeling of Digitized pages from Techical Journals,” IEEE Transaction on pattern Analysis and Machine Intelligence, 15:743-747, 1993), and consequently, the final outcome depends on the dependability of the prior information and how well that is represented within the rules.