The present invention relates generally to the access of computerized information. More particularly, the invention relates to a method for extracting title text or other textual regions from bitmap images, such as are generated when documents are scanned. The extracted title text may be used in a number of ways, including keyword searching or indexing of bitmap image databases. In certain cases, the extracted title text may provide all the information needed, such as newspaper headlines.
The world is rapidly becoming an information society. Digital technology has enabled the creation of vast databases containing a wealth of information. The recent explosion in popularity of image-based systems is expected to lead to the creation of enormous databases that will present enormous database access challenges. In this regard, the explosion in popularity of the World Wide Web is but one example of how information technology is rapidly evolving towards an image-based paradigm.
Image-based systems present a major challenge to information retrieval. Whereas information retrieval technology is fairly well advanced in coded character-based systems, these retrieval techniques do not work in image-based systems. That is because image-based systems store information as bitmap data that correspond to the appearance of the printed page and not the information content of that page. Traditional techniques require the conversion of bitmap data into text data, through optical character recognition (OCR) software, before information retrieval systems can go to work.
Unfortunately, optical character recognition software is computationally expensive, and the recognition process is rather slow. When dealing with large quantities of image-based data, it is not practical to perform optical character recognition on the entire database. Furthermore, even where time and computational resources permit the wholesale OCR conversion of image data into text data, the result is still a large, unstructured database, without a short list of useful keywords that might allow a document of interest to be retrieved and reviewed. Searching through the entire database for selected keywords may not be the optimal answer, as often full text keyword searches generate far too many hits to be useful. In addition, other drawbacks of OCR conversion include difficulty in providing adequate results when a document includes noise (such as that caused by transmission by facsimile), variable fonts or grayscale images. Therefore, there is a need for a method of extracting only key portions of a document which includes text that provides a description of the remaining portion of the document, such as titles.
U.S. Pat. No. 5,818,978 (xe2x80x9cthe ""978 Patentxe2x80x9d) discloses a character recognition system which is used to enhance prior art optical character recognition methods by thresholding a grayscale image and then locating characters using prior art segmentation techniques. However, because the system of the ""978 Patent uses traditional segmentation methods on binary images thresholded at only a single level, unreasonable results will be produced for documents with complex graphics or documents having multiple regions with similar grayscale intensities.
U.S. Pat. No. 5,892,843 (xe2x80x9cthe ""843 Patentxe2x80x9d), assigned to the same assignee as the present invention, discloses a system for extracting titles, captions and photos from scanned images. However, since the ""843 Patent, like the ""978 Patent, uses traditional segmentation methods on a binary image (which may have been generated by thresholding a grayscale image), documents with complex graphics or documents having multiple regions with similar grayscale intensities will be processed by the system of the ""843 Patent with less than ideal results. In addition, the method of the ""843 Patent requires that the language of any document being processed be identified prior to processing. Other drawbacks of the method of the ""843 Patent include an inability to locate titles positioned on more than one line, titles located within a photographic region, or titles having reverse video (i.e., white letters on a black background).
It is therefore an object of the present invention to provide a method for extracting titles and headlines from a scanned image which provides more reliable results than prior art methods of title extraction when the image has complex graphics or regions of similar grayscale intensities.
It is an additional object of the invention to provide a method for extracting titles and headlines from a scanned image which extracts titles positioned on more than one line.
It is yet a further object of the invention to provide a method for extracting titles and headlines from a scanned image which extracts titles within photographic regions.
It is another object of the invention to provide a method for extracting titles and headlines from a scanned image which extracts titles consisting of reverse video.
It is yet another object of the invention to provide a method for extracting titles and headlines from a scanned image which does not require that the language of the document being processed be identified prior to such processing.
Various other objects, advantages and features of the present invention will become readily apparent from the ensuing detailed description and the novel features will be particularly pointed out in the appended claims.
The present invention is directed to a method of delineating titles within a grayscale image, in which a grayscale image is received, e.g., by scanning a document, and then is subjected to thresholding, preferably of a multi-level nature, to obtain, preferably, a plurality of binary images representing the original grayscale image. Each of the binary images is preferably pre-processed to filter any noise components therein, and then all connected components within each of the binary images are identified and clustered to identify possible title regions therein. Next, each binary image is preferably post-processed to merge possible title regions comprising strokes and to remove non-title regions, e.g., photographic regions, from the previously identified possible title regions in each of the images by comparing characteristics of the previously identified possible title regions to pre-determined criteria. Further, certain of the previously identified possible title regions within each of the binary images which satisfy pre-determined criteria are preferably merged together. Still further, previously identified possible title regions, remaining after the optional post-processing and merging steps from each of the binary images are combined. Finally, certain of the previously identified possible title regions from separate binary images are preferably merged to produce identified title regions for further processing, e.g., for use as a mask to extract the title text from the original image.
Preferably, the multi-level thresholding step is performed by generating a histogram of the number of runs that would result from thresholding the grayscale image at each intensity level within the grayscale image. Then a sliding profile of the histogram is generated and the peaks within the sliding profile are identified. The value at each of the peaks of the sliding profile is identified and those values are used for thresholding the grayscale image to produce a plurality of binary images.
Three methods are presented for performing the preprocessing step. The first method comprises an adaptive morphological method consisting of a recursive series of erosion steps and dilation steps. The second method consists of a hole-filling method comprising the steps of first generating a sliding window having a center region and four outer regions, and then moving the sliding window across the binary image. At each point within said binary image, the number of zero value pixels in each outer region are calculated. When each outer region contains at least one zero value pixel therein and when the sum of the number of zero value pixels in each outer region is greater than a first predetermined value, the pixel corresponding to the center region is set to 0, and when the sum of the number of zero value pixels in each outer region is less than or equal to a second predetermined value, the pixel corresponding to the center region is set to 1. Finally, the third method, which is the preferred method, comprises a morphological method by which a simple opening operation is performed using a predetermined structural element.