1. Field of the Invention
This invention relates to a layout analysis program, a layout analysis apparatus, a layout analysis method and a medium for extracting a text block or the like from an image.
2. Description of the Related Art
An OCR (optical character reader) can recognize the layout of an image of a document and the characters in one or more than one character regions of the image that is read out typically by means of a scanner. In recent years, OCR applications and document management systems for storing, retrieving and/or reusing ordinary documents and other documents have been attracting attention. Most recently, OCRs have been required to scan not only black and white documents but also color documents typically by the provisions of the requirements in the e-document law.
In the field of the OCR technology for color images, related processes are executed by following the sequence as shown below.
1. Layout analysis process
2. Binarization process
3. Character recognition process in a character region
Of the above listed three processes, the layout analysis process tends to be less accurate if compared with the remaining two processes. Furthermore, this tendency is particularly remarkable when the layout analysis process is executed on a color image.
Now, the configuration of a known layout analysis apparatus for analyzing the layout of a color image will be discussed below as an example. FIG. 17 is a schematic block diagram of the known layout analysis apparatus for analyzing the layout of a color image, showing its configuration. The layout analysis apparatus comprises an image acquiring section 101, a NiblackDeltaGNoiseRemoveFast binarizing section 102, a binary image layout analyzing section 103, a text block dividing section 104, a text block reconfiguring section 105 and a layout information generating section 106.
Now, the operation of the known layout analysis apparatus for analyzing the layout of a color image will be described below. Firstly, the image acquiring section 101 acquires a color image. Then, the NiblackDeltaGNoiseRemoveFast binarizing section 102 executes a NiblackDeltaGNoiseRemoveFast binarization process, which is based on the Niblack binarization process, on the acquired color image. Thereafter, the binary image layout analyzing section 103 executes a binary image layout analysis process, which is a layout analysis process for binary images. The technique described in Patent Document 1 [Jpn. Pat. Appln. Laid-Open Publication No. 11-219407] is used here for the binary image layout analysis process. As a result, text blocks, which contain character elements, and graphic separator blocks (picture regions, table regions, separators, frame regions), which contain non-character elements, are extracted.
The text block dividing section 104 then divides each of the text blocks. This process is executed because the columns in a page of a newspaper may not be extracted properly and two or more than two columns may be extracted collectively as a column. In this dividing process, a histogram is generated for the periodicity of black pixels prepared by projecting the black pixels in a text block vertically and horizontally and the positions to be used for the division are determined on the basis of the histogram.
Thereafter, the text block reconfiguring section 105 reconfigures the text blocks by coordinating two adjacent text blocks when the top and bottom coordinates and the left end and right end coordinates of the adjacent text blocks are located close to each other. Subsequently, the layout information generating section 106 outputs the obtained text blocks and the graphic separator blocks as layout information to end the layout analysis.
Patent Document 2 [Jpn. Pat. Appln. Laid-Open Publication No. 2001-184511] describes an image processing apparatus, an image processing system, an image processing method and a storage medium adapted to acquire a plurality of binary images from a multilevel image that is an original image, extract regions containing aggregates of block pixels from the plurality of binary images, divide the regions according to the crowded condition of starting pixels and ending pixels of each aggregate of black pixels and identify the attributes (characters, pictures, etc.) of each of the regions produced by the division on the basis of the histogram of the original image in each of the regions produced by the division.
Patent Document 3 [PCT Republication No. 00/62243] describes an apparatus and a method for extracting a character string according to the basic components of a document image adapted to extract basic components of a document image which may be a binary image, a multiple image, a color image or some other image and determine if each component is a character component or not by using the relation of inclusion among the basic components. Then, a set of character components is extracted according to the outcome of the determination and strings of characters are extracted from the set of character components. Thereafter, the binary image generating section of the character string extracting apparatus binarizes the lightness component of each pixel according to a predetermined threshold value and generates a binary image that is constituted by pixels having either a value that corresponds to a drawn region or a value that corresponds to a background region. Additionally, the binary image generating section highly accurately extracts character patterns, although it cannot reliably extract picture patterns and table patterns. Each character part of white characters on a black background is reversed and extracted as a character part of black characters on a white background.
However, among the above-described known layout analysis techniques, the one adapted to use only a single binarization method cannot extract both characters and graphics highly accurately. Additionally, it cannot cope with a plurality of background colors and white characters in a character region. For example, while the above described NiblackDeltaGNoiseRemoveFast binarization process can hold the continuity of ruled lines, it cannot extract while characters on a black background. Additionally, it sometimes cannot extract a text block correctly when characters and pictures are arranged close to each other because they can easily contact with each other.
Techniques for extracting a character region by means of a histogram of a multilevel image like the one disclosed in Patent Document 2 cannot provide a high degree of accuracy. Generally, a character region extracted from a binary image is more accurate than a character region extracted from a multilevel (gradated) image. Additionally, the technique of Patent Document 2 detects regions from a plurality of binary images but, when generating a histogram of an original image for the larger region of two regions that show a relation of complete inclusion, it only uses a relation of excluding the smaller region.