1. Field of the Invention
The present invention relates to a character string extraction apparatus for extracting a character string from a document image based on the basic components of information, such as characters, graphics, etc., included in a document image and a method thereof.
2. Description of the Related Art
A character string pattern in a document image corresponds to a sequence of one or more character patterns, and a character pattern corresponds to a pattern, such as a character, symbol, etc., of an arbitrary language. A character string extraction apparatus receives a document image as input, extracts a character string pattern from the document image and supplies the extracted character string pattern to a subsequent character encoding process or retrieval process. There is currently a character string extraction apparatus using a binary document image as input for such a product.
Recently, a document management system for sharing information has been focussed on and a mechanism for uniformly managing a variety of documents, such as an electronic document with structure, a raw image document without structure, a document recorded on paper, etc., is also demanded.
Therefore, a character string extraction apparatus for extracting text information from a document image for the purpose of information retrieval has also been highly expected as a technology to retrieve an image document without structure and a paper document. In particular, since gray scale documents and color documents containing a photograph have increased, necessity of the technology to accurately extract a character string from these documents has greatly increased.
To meet such a demand, several character string extraction technologies that can be used for various purposes and can handle a document in which a variety of pieces of information are mixed have been proposed. These technologies do not require any foreknowledge of document structure and a mixture of figures and text, a mixture of sentences horizontally and vertically written, and extraction of a character string in a figure are also taken into consideration in the technologies. Some typical methods of the technologies are described below.
However, a method using image gradation, a method using the projection distribution of black pixels and a method using the image feature in a local area have been excluded, since these methods are not suitable for the extraction of a complicatedly indented character string or the extraction of a character string in a figure.
A conventional character string extraction technology is based on the basic idea that an aggregate of character components, each of which is an image pattern representing a part of a character or the entire character, is extracted by some method and a character string as a partial aggregate of character components are extracted using the size homogeneity and spatial closeness between the character components. In this case, the accuracy of character component extraction greatly affects the accuracy of character extraction. In the conventional character component extraction method, the following methods are used.
In the first method of character component extraction, an aggregate of basic components, each of which is an image pattern representing a part of a character or the entire character, graphic, etc., is extracted by some method, the basic components are classified according to size/shape and only character components are extracted.
For example, in Patent Application Laid-open Nos. 61-072374 (Character Recognition Apparatus) and 61-026150 (Document Image File Registration Retrieval Apparatus) of Japanese Laid-open Patent Gazette, character components are extracted based on the assumption that the sizes of characters in a document image are almost the same.
In Patent Application Laid-open Nos. 62-165284 (Character String Extraction System) and 09-167233 (Image Processing Method and Image Processing Apparatus), the circumscribed rectangle of a black pixel joint component in a binary image is designated as a basic component, and a basic component of the size of a specific value or less is assumed to be a character component and is extracted.
In Patent Application Laid-open No. 06-111060 (Optical Character Reading Apparatus), a joint component for each color of a color image is designated as a basic component, a basic component of the size of a specific value or less is assumed to be a character component and is extracted, thereby realizing character string extraction from a color image.
In the second method of character component extraction, an aggregate of basic components are classified into character components and non-character components, using a confidence degree obtained by performing character or character string recognition for a character string candidate which is composed of a basic component or an aggregate of basic components.
For example, in Patent Application Laid-open No. 05-028305 (Image Recognition Apparatus and Recognition Method), character string candidates are generated by the spatial closeness between basic components, only candidates which seem to be a character string are selected based on the evaluation value obtained as a result of character recognition, and not only basic components but also character strings are extracted.
A method embodying both the first and second methods described above is also proposed. For example, in Patent Application Laid-open No. 07-168911 (Document Recognition Apparatus), the circumscribed rectangles of black pixel joint components are designated as basic components and the basic components are classified into four groups: a character candidate, a graphic candidate, a ruled line candidate and an image candidate, based on the size/ratio of the vertical and horizontal lengths. If a confidence degree that is obtained by performing character recognition for a character candidate is low, the character candidate is changed to a graphic candidate. If a confidence degree that is obtained by performing character recognition for a graphic candidate is high, the graphic candidate is changed to a character candidate and the character components are extracted.
However, an adequate accuracy of character component extraction cannot be obtained by such a conventional character string extraction technology, and as a result, an adequate accuracy of a character string itself also cannot be obtained, which is a problem.
According to the first method, if respective basic components corresponding to different kinds of information, such as a character and a graphic, have similar sizes, the extraction of character components fails, and as a result, the extraction of character strings also fails. Therefore, in this case, an adequate accuracy cannot be obtained.
According to the second method, even in such a case, distinction between a character component and a non-character component can be improved by performing character recognition or character string recognition. However, at the current level of the character recognition technology, the reliability of the confidence degree itself of a character recognition result is not sufficient.
Therefore, the threshold value of a confidence degree needed to judge with high reliability that a basic component is a character component must be set to a value greatly different from the threshold value of a confidence degree needed to judge with high reliability that a basic component is not a character component. As a result, it becomes difficult to judge a basic component having an intermediate confidence degree between the two threshold values, and if the basic component is incorrectly judged to be/not to be a character component, an adequate accuracy of character component extraction cannot be obtained.
It is an object of the present invention to provide a character string extraction apparatus for extracting a character string more accurately using basic components included in a document image and a method thereof.
The character string extraction apparatus according to the present invention comprises a basic component extraction unit, a character component extraction unit and a character string extraction unit.
The basic component extraction unit extracts an aggregate of a plurality of basic components from an input document image. The character component extraction unit judges whether a basic component corresponds to a character component using an inclusion relationship between basic components included in the aggregate of the basic components and extracts an aggregate of character components. The character string extraction unit extracts a character string using the aggregate of the character components.