Documents are often defined as nothing more than a collection of primitive elements that are drawn on a page at defined locations. For example, a PDF (portable document format) file might have no definition of structure and instead is nothing more than instructions to draw glyphs, shapes, and bitmaps at various locations.
A user can view such a document on a standard monitor and deduce the structure. However, because such a file is only a collection of primitive elements, a document viewing application has no knowledge of the intended structure of the document. For example, a table is displayed as a series of lines and/or rectangles with text between the lines, which the human viewer recognizes as a table. However, the application displaying the document has no indication that the text groupings have relationships to each other based on the rows and columns because the document does not include such information. Similarly, the application has no indication of the flow of text through a page (e.g., the flow from one column to the next, or the flow around an embedded image), or various other important qualities that can be determined instantly by a human user.
This lack of knowledge about document structure will not always be a problem when a user is simply viewing the document on a standard monitor. However, it would often be of value to a reader to be able to access the file and edit it as though it were a document produced by a word processor, image-editing application, etc., that has structure and relationships between elements. Therefore, there is a need for methods that can reconstruct an unstructured document. Similarly, there is a need for methods that take advantage of such reconstructed document structure to idealize the display of the document (e.g., for small-screen devices where it is not realistic to display the entire document on the screen at once), or to enable intelligent selection of elements of the document.
In the modem world, more and more computing applications are moving to handheld devices (e.g., cell phones, media players, etc.). Accordingly, document reconstruction techniques must be viable on such devices, which generally have less computing power than a standard personal computer. However, document reconstruction often uses fairly computation and memory intensive procedures, such as cluster analysis, and the use of large chunks of memory. Therefore, there is further a need for techniques that allow for greater efficiency in document reconstruction generally, and cluster analysis specifically.