The present invention relates generally to document processing. More particularly, the invention relates to the extraction of a user-enclosed portion of text or non-text regions from bitmap images.
People like to scribble marks and notes on documents while reading. For example, when a person is reading a book or a magazine, he or she might draw circles with a pen over the parts that are of particular interest. When the person underlines a few lines, circles a paragraph, writes notes on a page these notes and marks often convey important cues to the content of the document and may serve as keys or references for communications with other people. As more and more paper documents are now being converted and archived in electronic media, it is useful if these underlines, highlights, circles, handwritten or Post-It notes on a paper document can be automatically identified, located, their associated contents extracted and preserved in a document management system.
The present invention describes a technique for locating and extracting one type of user-drawn marks from a scanned documentxe2x80x94the user enclosed regions. The invention is based on the bi-connected component analysisi in graph theory. The invention first represents the content of the input image as run-length segments. The invention then constructs line adjacency graphs from the segments. Finally, the invention detects user-enclosed cicles as bi-connected components of the line adjacency graphs.
The present invention is useful in applications such as in an electronic filing system or in storage management for document databases. Currently, the burden of cutting and pasting a selective region from a page (for example, an article from a newspaper) for archival is on the user. However, a user is sometimes only interested in specific portions on a page. The circled region extraction technique offers a means for a user to simply mark the regions of interest and let the imaging process identify the regions and save them alone. Alternatively, different compression strategies may be applied to user-enclosed regions to preserve the quality of the image in these regions.
The present invention analyzes the image of the scanned document in its native bitmap format using a connected component module. The invention is writing system independent. It is capable of extracting user-enclosed regions from document images without regard to what character set or alphabet or even font style has been used. The connected component module then stores the components of the image that are connected. The connected component data is stored in a datastructure in the form of a line adjacency graph to expedite the further processing of the connected component data.
The connected component data is then analyzed by a graph traversal module to extract geometric properties of each connected component and store the geometric properties in a datastructure. The geometric features extracted are those geometric features that are necessary for further analysis by the invention.
The invention then separates the largest bi-connected component from the user-enclosed regions of the document image by utilizing a bi-connected component module. The bi-connected component module detects any enclosure regardless of shape. Furthermore, the user drawn enclosure can cross lines of text or graphics on the document paper and still be recognized as a bi-connected component. The bi-connected component module utilizes a depth-first search that allows the detection of the largest bi-connected component to be done in an efficient manner.
Following the bi-connected component module, a detection analysis filter further refines the extraction process by qualifying each user-enclosed candidate. These additional heuristics eliminate from the possible selection of user-enclosed regions those bi-connected areas that are not above a minimum size, or are photographic images.
After having selected the user-enclosed regions the extraction module separates the bitmap portion that lies within the user-enclosed region and stores that extracted portion of the document image in a storage medium for future reference and manipulation by the user. The present invention enables the user to save a large amount of disk storage space by extracting only the portions of the document image that the user is interested in.
For a more complete understanding of the invention, its objects and advantages, reference may be made to the following specification and to the accompanying drawings.