1. Field of the Invention
The present invention is concerned with processing multimedia data files to provide information supporting user navigation of multimedia data file content.
2. Background of the Invention
The demand for hypermedia applications has increased with the growing popularity of the World Wide Web. As a result, a need for an effective and automatic method of creating hypermedia has arisen. However, the creation of hypermedia can be a laborious, manually intensive job. In particular, hypermedia creation can be difficult when referencing content in documents including images and/or other media.
In many cases, the hypermedia authors need to locate Anchorable Information Units (AIUs) or hotspots that are areas or keywords of particular significance, and make appropriate hyperlinks to relevant information. In an electronic document, a user can retrieve associated information by selecting these hotspots as the system interprets the associated hyperlinks and fetches the corresponding relevant information.
Previous research in this field has taken scanned bitmap images as the input to a document analysis system. The classification of the document system is often guided by a priori knowledge of the document's class. There has been little work done in using postscript files as a starting point for document analysis. Certainly, if a postscript file is designed for maximum raster efficiency, it can be a daunting task even to reconstruct the reading order for the document. Previous researchers may have assumed that a well-structured source text will always be available to match postscript output and therefore working bottom-up from postscript would seldom be needed. However, PDF documents can be generated in a variety of ways including an Optical Character Recognition (OCR) based route directly from a bit-mapped page. The extra structure in PDF, over and above that in postscript, can be utilized towards the goal of document understanding.
Previous work proposed methods related to the understanding of raster images. Being an inverse problem by definition, this task cannot be accomplished without making broad assumptions. Directly applying these methods on PDF documents would make little sense as they are not designed to make use of the underlying structure of PDF files, and thus will produce undesirable results.
In contrast to the geometric layout analysis, logical layout analysis has received very little attention. Some methods of logical layout analysis perform region identification or classification in a derived geometric layout. However, these approaches are primarily rule based and thus, the final-outcome depends on the dependability of the prior information and how well the prior information is represented within the rules.
Systems such as Acrobat do not have the ability to process images. Rather Acrobat runs the whole document through an OCR system. Clearly, OCR is not able extract objects, but even in the case of understanding text the output can be unreliable as a general-purpose OCR can be error prone when used to understand scanned in images directly.
Therefore, a need exists for a method of analyzing and extracting text from PDF documents created using various means.