1. Field of the Invention
The present invention relates to an image data recognizing process, and in particular, to a title extracting apparatus for extracting a title region from a document image obtained as image data of a document and a method thereof.
2. Description of the Related Art
Related art references for extracting a partial region, such as a title of a document, from a document image, that is image data obtained from a conventional document by a photoelectric converting device such as a scanner, include:
(1) A title extracted from a document with fixed regions (as disclosed in Japanese Patent Laid-Open Publication No. 64-46873).
(2) A title portion of a document marked with a particular marking means such as a color marker or frame lines. The document is scanned by a scanner and the title portion is extracted (as disclosed in Japanese Patent Laid-Open Publication No. 01-150974).
(3) A physical structure such as a character string of a document or a photograph is represented as a tree structure or the like. By matching tree structures as logical structures, the physical structures tagged with "title", "writer name", and so forth (as disclosed in Japanese Patent Laid-Open Publication Nos. 01-183784, 05-342326, and so forth).
(4) A region of a part of a document image is assigned. The inside of the region is projected and a histogram of black pixels is generated. A range of continuous values of projected black pixels between two predetermined thresholds is obtained. A portion of the length of the continuous portion that exceeds another predetermined threshold is extracted as a title (as disclosed in Japanese Patent Laid-Open Publication No. 05-274471).
In addition, the following related art references for extracting a partial region such as a title from a document image that includes a table are known.
(5) A title is extracted from a formatted document including a table (as disclosed in Japanese Patent Laid-Open Publication No. 07-093348).
(6) A document image is projected and a histogram of black pixels is generated. Frame lines are extracted from the distribution of the histogram. A character string surrounded by the frame lines is extracted as a title (as disclosed in Japanese Patent Laid-Open Publication No. 05-274367).
(7) Characters of all character regions in a document image are recognized. A knowledge process such as keyword collation and mode prime analysis is linguistically and logically performed for the obtained character codes. A character string that is likely to be a title is extracted from a result of the knowledge process (as disclosed in Japanese Patent Laid-Open Publication No. 03-276260).
(8) A region surrounded by a white pixel connected portion in a document image is extracted as a table portion. Ruled lines are extracted from the inside of the table. A region surrounded by the ruled lines is obtained. An image in the obtained region is template-matched with a predetermined character string (template). Thus, the same character string is extracted as a title (as disclosed in Japanese Patent Laid-Open Publication No. 03-74728).
However, these related art references have the following problems.
In methods (1) and (5), only formatted documents are processed. When a format is changed, the assignment of a portion to be extracted must also be changed.
In method (2), it is burdensome to mark an original document.
In the method (3), a dictionary of logical structures represented with tree structures or the like must be prepared. When the logical structure of a document is not contained in the dictionary, a title cannot be precisely extracted.
In method (4), although the method for assigning a region of a document image is not clear, if this method is applied for all regions of the document image, a large black pixel portion such as a table or a chart will be incorrectly extracted as a title. Moreover, in a document that contains only characters, a character string in a large font is not always a title. Thus, a title may not be correctly extracted.
In method (6), a title may be extracted if a table containing the title is surrounded by simple ruled lines. However, since a table contains complicated ruled lines, a title region cannot be precisely distinguished.
In method (7), the currently available character recognizing process takes a long time. Thus, this method is substantially used as a batch process. In addition, since the recognition ratio is not 100%, an incorrect portion may be extracted as a title unless information of a title position is used.
In method (8), the template matching process for an image takes time. In addition, the process is adversely affected by the shape and size of a font used in the template. Moreover, in this method, only predetermined character strings can be extracted as titles. Thus, in this method, the types of documents that can be processed are limited.
Thus, in the conventional title extracting methods, special preparations or special operations are required. In addition, documents and titles that can be processed by such methods are limited.