1. Field of the Invention
The present invention relates to a document image recognition apparatus and a computer-readable storage medium storing a document image recognition program for recognizing a document image by detecting the tilt of a document image in a document, etc. read by an image scanner or received from-a facsimile device, amending the tilt, and extracting a character line and column.
To read a larger volume of document through an optical character reader (OCR) engine, it is necessary to provide the function of analyzing the layout of document text containing both vertical and horizontal character lines such as Japanese newspaper text. The present invention provides the new technologies of detecting the tilt of text for a correct tilt amendment to a document image and extracting lines and columns to correctly recognize document images as technologies required to analyze the layout of text having vertical and horizontal character lines.
2. Description of the Related Art
(1) Detecting the Tilt of a Document Image
To read a common printed document, it is necessary to first obtain a document image using an image input device such as an image scanner, etc. At this time, a tilt is normally given to an original document in setting it. To use the document in electronic filing or document recognition, the tilt of the document image should be detected and amended.
In the conventional tilt detecting technology, it is assumed that characters are regularly arranged in a text area which forms an important part of a document image.
For example, the first system is suggested by the xe2x80x98A Fast Algorithm for the Skew Normalization of Document Imagesxe2x80x99 by Nakano, et al. in the publication D, vol. J69-D, No.11, pp.1833-1834 from the Transactions of the Institute of Electronics and Communication Engineers of Japan. That is, the tilt of a character string is estimated by assuming that the reference line of the character string is almost regularly provided, performing the Hough transformation on the coordinate value of the lower end of a character block, and detecting the peak value in the Hough space.
The second system is suggested by the xe2x80x98Document Image Tilt Detection Apparatusxe2x80x99 by Mizuno, et al. in Tokukaihei 7-192085. That is, the tilt of a character string is estimated by extracting the connected components of characters, generating a provisional character line by combining vicinal connected components, and obtaining a straight line touching the provisional character line.
The third system is suggested by the xe2x80x98Document Tilt Amendment Apparatusxe2x80x99 by Saito, et al. in Tokukaihei 2-170280. That is, a document image is provisionally amended by sequentially changing the tilt angle xcex8, and the angle xcex8 for the smallest area of the enclosing rectangle containing all black pixels in the amended image is obtained.
(2) Layout Analysis (Extracting Lines and Columns)
Conventionally, the following method has been suggested as a method of extracting lines and columns of character strings in a document image containing vertical and horizontal arrangements of characters.
For example, the fourth system is suggested by the xe2x80x98Document Image Processing Apparatusxe2x80x99 by Tsujimoto, et al. in Tokukaihei 1-183783. That is, the column of an input document can be automatically determined by projecting a character line of an input document in a specific direction, and generating a projective distribution.
Furthermore, the fifth system is suggested by the xe2x80x98Document Image Processing Apparatusxe2x80x99 by Mizutani, et al. in Tokukaihei 5-174179. That is, columns are extracted using an area in which no components are arranged in an input document.
The sixth system is suggested by the xe2x80x98Character String Extracting Method and Apparatusxe2x80x99 by Hiramoto, et al. in Tokukaihei 10-31716. That is, character lines are arranged in different directions, and extracted from a document containing areas having characters different in size and pitch.
For example, a number of Japanese printed documents have vertical and horizontal arrangements of characters. Therefore, it is necessary to appropriately extract character lines and columns when document text is recognized.
However, there are the following problems with the above described conventional systems.
(1) Problems in Detecting the Tilt of a Document Image
Since the lines are arranged in a fixed direction in the above described first system, the system cannot be applied to a document containing both horizontal and vertical character lines as in Japanese newspaper. Furthermore, since all characters are not arranged on a reference line even in a document having character lines in a fixed direction, error cannot be avoided. Additionally, there is another problem that the Hough transformation process requires a large volume of computation.
In the above described second system, there is the possibility that a large error may occur because, as in Japanese newspaper, a character line can be mistakenly extracted as a horizontal character line from the column having vertical character lines,
Although the above described third system is designed to detect the tilt of a document text containing both horizontal and vertical character lines, a tilt angle is detected according to small amount of information about the area of an enclosing rectangle containing black pixels of a document image. Therefore, there is the problem that the precision of a detected tilt is unstable. Furthermore, since it is necessary to repeatedly perform the process of extracting a rectangular area by rotating an image itself, a large volume of computation is required.
(2) Problem with Layout Analysis
Since the above described fourth system preliminarily extracts a character line, and performs a column extracting process based on the preliminary extraction, a non-uniform column which is divided into a number of small character line portions can be actually divided into small portions.
Since the fifth system extracts a column using a blank area, there is the possibility that a column can be mistakenly extracted when a document contains a space between lines larger than a space between columns.
This is a serious problem with a document image of the text formed by closely arranged vertical and horizontal character lines. For example, if a document image contains a small space between the vertically written article and the caption of the photograph as shown by a rectangular box below the photograph area at the upper left corner on the newspaper shown in FIG. 1, then the article and the caption are mistakenly recognized as one column and the characters in each line of the horizontally written caption are mistakenly recognized as the leading two characters of the vertically written article.
Since the column area is extracted as a preprocess performed before a very precise line extracting process in the sixth system, a non-uniform column which is divided into a number of small character line portions can be actually divided into small portions, thereby performing a wrong line extracting process.
That is, in the above described technology, either 1 (basic element set)xe2x86x92line extracting processxe2x86x92column extracting processxe2x86x92(layout analysis result) or 2 (basic element set)xe2x86x92column extracting processxe2x86x92line extracting processxe2x86x92(layout analysis result) is followed and based on the bottom-up process or the top-town process. In the above described technologies, it is assumed that the line extracting process and the column extracting process are independent of other processes, and lines and columns are extracted by sequentially performing the processes, thereby causing the problems with these technologies.
Based on the above described background, the present invention has been developed to provide a document image recognition apparatus capable of detecting the tilt of a document containing both horizontal and vertical character lines at a high speed and with high precision, and extracting lines and columns with high precision even if a document image having a complicated structure with both horizontal and vertical character lines is to be recognized.
One of the embodiments of the present invention is an apparatus for recognizing a document image stored as electronic data by amending the tilt of the document image. The apparatus includes a character element extraction unit for referring to a document image stored as electronic data and extracting a set of elements forming characters from the document image; a line candidate extraction unit for referring to the extracted set of character elements and extracting candidates for horizontal character lines and vertical character lines from the set of character elements; a line reliability estimation unit for estimating the reliability of an extracted candidate; a line extraction unit for extracting a set of probable lines based on the estimated reliability; and a tilt estimation unit for estimating the tilt of the document image based on the arrangement of the character elements contained in the extracted set of probable lines.
According to the present embodiment, using a set of character elements extracted from the character element extraction unit, a line candidate is extracted, the reliability of the line candidate is estimated, a probable line is extracted according to the reliability, and then the tilt of the document image is estimated. That is, according to the present embodiment, a document image is not processed, that is, rotated, etc. to detect the tilt of the document image. As a result, the amount of computation can be considerably reduced. Furthermore, according to the present embodiment, since the line candidate extracting unit extracts candidates in the horizontal and the vertical directions, and the line reliability estimation unit and the line extraction unit extract a set of probable horizontal and vertical character lines. Therefore, according to the present embodiment, the tilt of a document containing both horizontal and vertical character lines can be detected and amended. Furthermore, since the tilt estimation unit estimates the tilt using only the character elements forming probable lines, the tilt can be estimated with high precision and resistance to noise.
Another embodiment of the present invention is an apparatus for recognizing a document image by analyzing the layout of a document indicated by a document image which is stored as electronic data and is to be recognized. The apparatus includes a basic line extraction unit for extracting a set of lines in a fixed direction from a set of basic elements forming the document image stored as electronic data; and a line/column reciprocal extraction unit for extracting a line and a column by reciprocally extracting a column based on the association between lines and extracting a line based on the restrictions by the extracted column.
The feature of this embodiment of the present invention resides in that the extraction of lines is correlated to the extraction of columns, that is, a line extraction result is reflected in a column extracting process while a column extraction result is reflected in a line extracting process. Based on this feature, lines and columns can be extracted with high precision from a document image having a complicated structure in which horizontal and vertical character lines are contained in a mixed manner, a non-uniform column containing small divided character line portions exists, a space between columns is smaller than a space between lines, etc.