The present invention relates generally to computerized information access. More particularly, the invention relates to a computerized system for extracting title text or photographs (including captions) or other text or nontext regions from bitmap images, such as from scanned documents. The extracted title text or caption text may be used in a number of ways, including keyword searching or indexing of bitmap image databases, while the extracted photographs may be used for graphical browsing.
The world is rapidly becoming an information society. Digital technology has enabled the creation of vast databases containing a wealth of information. The recent explosion in popularity of image-based systems is expected to lead to the creation of enormous databases that will present enormous database access challenges. In this regard, the explosion in popularity of the World Wide Web is but one example of how information technology is rapidly evolving towards an image-based paradigm.
Image-based systems present a major challenge to information retrieval. Whereas information retrieval technology is fairly well advanced in coded character-based systems, these retrieval techniques do not work in image-based systems. That is because image-based systems store information as bitmap data that correspond to the appearance of the printed page and not the information content of that page. Traditional techniques require the conversion of bitmap data into text data, through optical character recognition (OCR) software, before information retrieval systems can go to work.
Unfortunately, optical character recognition software is computationally expensive, and the recognition process is rather slow. Also, typically photographs without text cannot be meaningfully processed with OCR technology. When dealing with large quantities of image-based data, it is not practical to perform optical character recognition on the entire database. Furthermore, even where time and computational resources permit the wholesale OCR conversion of image data into text data, the result is still a large, unstructured database, without a short list of useful keyword that might allow a document of interest to be retrieved and reviewed. Searching through the entire database for selected keywords may not be the optimal answer, as often full text keyword searches generate far too many hits to be useful.
The present invention takes a fresh approach to the problem. The invention recognizes that there will be vast amounts of data that are in bitmap or image format, and that users will want to search this information, just as they now search text-based systems. Instead of converting the entire document from image format to text format, the present invention analyzes the bitmap data in its native format, to extract regions within the image data that correspond to the most likely candidates for document titles, captions or other identifiers, or to extract regions that correspond to photographs. The system extracts these document titles, captions or other identifiers and photographs from the bitmap image data, allowing the extracted regions to be further manipulated in a variety of ways. The extracted titles, captions or photographs can be displayed serially in a list that the user can access to select a document of interest. If desired, the extracted titles or captions can be converted through optical character recognition into text data that then can be further accessed or manipulated using coded character-based information retrieval systems.
Alternatively, even if the entire page is converted using optical character recognition, it may still be useful to locate various titles and other text or nontext regions using the scanned image. The invention will perform this function as well.
The invention is multilingual. Thus it can extract titles or captions from bitmap data, such as from scanned documents and from documents written in a variety of different languages. The title extraction technology of the invention is also writing-system-independent. It is capable of extracting titles from document images without regard to what character set or alphabet or even font style has been used.
Moreover, the system does not require any prior knowledge about the orientation of the text. It is able to cope with document layouts that have mixed orientations, including both vertical orientation and horizontal orientation. The invention is based on certain reasonable "rules" that hold for many, if not all languages. These rules account for the observation that title text or caption text is usually printed in a way to distinguish it from other text (e.g., bigger font, bold face, centered at the top of a column). These rules also account for the observation that intercharacter spacing on a text line is generally closer than interline spacing and that text lines are typically either horizontal or vertical.
The invention extracts titles, captions and photographs from document images using document analysis and computational geometry techniques. The image is stored in a bitmap buffer that is then analyzed using connected-component analysis to extract certain geometric data related to the connected components or blobs of ink that appear on the image page. This geometric data or connected component data is stored in a data structure that is then analyzed by a classification process that labels or sorts the data based on whether each connected component has the geometric properties of a character, or the geometric properties of a portion of an image, such as a bitmap rendition of a photograph.
Following classification, for text components the system then invokes a nearest-neighbor analysis of the connected component data to generate nearest-neighbor graphs. These are stored in a nearest-neighbor graphs data structure that represents a list of linked lists corresponding to the nearest neighbors of each connected component. The nearest-neighbor graphs define bounding boxes around those connected components of data that correspond to, for example, a line of text in a caption. The nearest-neighbor graphs are then classified as horizontal or vertical, depending on whether the links joining the bounding box centers of nearest neighbors are predominately horizontal or vertical.
Next a filter module analyzes the data to determine the average font height of all horizontal data, and a separate average font height for all vertical data. Then, each string of horizontal data is compared with the average; and each string of vertical data is compared with the average, to select those strings that are above the average height or those strings whose height exceeds a predetermined threshold. These are selected as title candidates to be extracted. If desired, further refinement of the analysis can be performed using other geometric features, such as whether the fonts are bold-face, or by identifying which data represent strings that are centered on the page.
After having selected the title candidates, the candidates are referenced back to the original bitmap data. Essentially, the bounding boxes of the connected components are merged into a single bounding box associated with the extracted title and that single bounding box is then referenced back to the bitmap data, so that any bitmap data appearing in the bounding box can be selected as an extracted title. If desired, the extracted title can be further processed using optical character recognition software, to convert the title image into title text.
Similarly, after having selected the photo candidates, the candidates are again referenced back to the original bitmap data. The bounding boxes of photo candidates which overlap with each other are merged into a single bounding box so that bitmaps appearing within the bounding box can be selected and extracted as part of the photo. If desired, caption text associated with a photo region can be identified and processed using optical character recognition software. The caption text can then be used as a tag to help identify the content of the photo, or for later searching .