The amount of information available via computers has dramatically increased with the wide spread proliferation of computer networks, the Internet and digital storage means. With such increased amount of information has come the need to transmit information quickly and to store the information efficiently.
Textual documents are often kept or transmitted in bitmap form (e.g., fax, paper, or bitmaps). In such form, the text cannot be easily extracted, searched, re-flowed, cut and pasted, re-purposed, or compressed, because it is not known a priori which pixels should be interpreted as text, background, image, or just noise.
There is, therefore, a need to recover structure, such as words, lines, and paragraphs and blocks from the pixel representation of a document. The indiscriminate pixels can be referred as “dead bits”, and the task is to recover the “live” structure of the document. The recovered structure can be used to locate the good candidate for textual characters and, thus, can facilitate optical character recognition (OCR). The textual structure can also help text selection for features such as “cut & paste”. Similarly, the textual structure can give hint(s) about how to insert and reflow text (e.g., reflow only the current paragraph while moving the next paragraphs as whole). Finally, the structure can help compression greatly by predicting the position of characters with respect to their current line, or block.
Additionally, data compression of digital documents should make use of an intended purpose or ultimate use of a document. Some digital documents are employed for filing or providing hard copies. Other documents may be revised and/or edited. Many conventional data compression methodologies fail to handle re-flowing of text and/or images when viewed, and fail to provide efficient and effective means to enable compression technology to recognized characters and re-flow them to word processors, personal digital assistants (PDAs), cellular phones, and the like. Therefore, if hard copy office documents are scanned into digital form, current compression technology can make it difficult, if not impossible, to update, amend, or in general change the digitized document.