As online document management systems become more prevalent, more users are storing their documents online. To facilitate browsing and/or searching of these documents, it is desirable to index these documents. For an electronic document (e.g., a document that is created by a word processing program on a computer system), indexing text and/or pictures is straightforward since information about text objects and pictures may be obtained directly from the document structure of the electronic document. However, a scanned document (e.g., a document that is converted into electronic form using a scanner, etc.) is an image of the original document that does not include information about the text objects and pictures. Thus, to index a scanned document, text objects and/or pictures must first be identified.
One technique for identifying text objects and/or pictures is to use an optical character recognition (OCR) technique. OCR techniques typically segment a scanned document into zones containing text and non-text objects. However, OCR techniques are not meant to identify pictures. The non-text zones often do not correspond to pictures. For example, non-text zones may include decorative graphics (e.g., lines, etc.) and other symbols that are not pictures. For a more complicated example, a block diagram typically contains sub-regions of text within the block diagram that makes it difficult for the OCR technique to properly identify the block diagram as a picture object.
Geometric-based techniques for identifying pictures in a scanned document may also be used. For example, geometric features of connected components in the scanned document may be used. Morphology or layout analysis may also be used to identify pictures in the scanned document. However, these techniques may not properly identify pictures. For example, two pictures that are located close to each other in the scanned document may be interpreted as a single picture object. Similarly, a single picture object that includes whitespace in the center may be interpreted as two separate pictures.
Another technique identifies pictures in images of presentation slides by using an OCR technique and Hough transforms. To identify regions of interest, morphological clustering can also be used. However, these techniques use a sequence of pages of the presentation slides to identify the background. Thus, this technique can not be used on individual pages.
Hence, a system and a method for identifying pictures in a document without the aforementioned problems are highly desirable.