(1) Field of the Invention
The present invention relates to a method and an apparatus for character recognition used when a document such as a printed document, a hand-written document or the like, which is not converted into text data, is converted into text data.
(2) Related Art
There is a certain type of a character recognizing apparatus for converting a printed document or a hand-written document into text data, into which a post-processing is introduced to propose a plurality of candidate characters if the apparatus cannot accurately recognize a character in the document so as to determine a correct character among the plural candidate characters, whereby a rate of recognition is improved.
FIG. 45 is a block diagram showing a general character recognizing apparatus. Now, an operation of the general character recognizing apparatus will be described with reference to FIG. 45. An image inputting unit 10 captures a paper document, and converts it into image data in a form of bit map. A region dividing unit 31 divides the image data into a character region and a region of picture, graphics or the like other than the character region.
A character extracting unit 32 extracts one character from the divided character region, and supplies it to a character recognizing unit 33. The character recognizing unit 33 recognizes the character to convert it into character data, and makes a plurality of conversion candidate characters. When a process of recognizing all characters in the character region is completed, a post-processing unit 34 morphologically analyzes a sentence configured with a combination of the conversion candidate characters.
Namely, the post-processing unit 34 requests a dictionary searching unit 20 to search for a word as a search condition. The dictionary searching unit 20 searches for the given word in a word dictionary 40, and replies as to whether or not there is the word in the word dictionary 40. The post-processing unit 34 outputs the word as a correct word if the word exists in the word dictionary 40.
The character recognizing apparatus corrects a character improperly recognized by the character recognizing unit 33, using the dictionary, as above.
However, the above character recognizing apparatus with the above structure requires enormous labor and time to make a dictionary such as the word dictionary, and maintenance thereof since the morphological analysis is carried out using the dictionary as a post-processing.
Further, the morphological analysis requires complex processes, a lot of time to configure and operate a system therefor, and tends to make a lot of mistakes if there exists an unrecognizable word in the document.
In the light of the above problems, an object of the present invention is to provide a method and an apparatus for character recognition, which can accurately correct misrecognition, and whose system can be configured readily and within a short period of time.
The object of the present invention is achieved by providing a character recognizing method, comprising the steps of:
recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input, character for each of input character images indicating input characters of the input document;
selecting a series of search character images indicating a series of search input characters from the series of input character images;
selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
preparing registered text data indicating one or more registered documents;
searching the registered text data for one particular conversion candidate character string for each of the particular conversion candidate character strings to count an occurrence frequency of the particular conversion candidate character string in the registered text data for each of the particular conversion candidate character strings;
selecting a specific particular conversion candidate character string corresponding to the highest occurrence frequency among those of the particular conversion candidate character strings from the particular conversion candidate character strings; and
determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images.
The object of the present invention is also achieved by providing a character recognizing apparatus, comprising:
character recognizing means for recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, selecting a series of search character images indicating a series of search input characters from the series of input character images and selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
registered text data storing means for storing registered text data indicating one or more registered documents;
full text searching means for searching the registered text data stored by the registered text data storing means for one particular conversion candidate character string for each of the-particular conversion candidate character strings recognized by the character recognizing means to count an occurrence frequency of the particular conversion candidate character string in the registered text data for each of the particular conversion candidate character strings;
post-processing means for selecting a specific particular conversion candidate character string corresponding to the highest occurrence frequency among those of the particular conversion candidate character strings counted by the full text searching means from the particular conversion candidate character strings recognized by the character recognizing means and determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images; and
registered text data outputting means for outputting the series of correct characters determined by the post-processing means as the series of search character images.
In the above steps and configuration, under circumstances where a character recognition cannot be correctly performed, an input character image indicating an input character is recognized as one or more conversion candidate characters for each of input character images indicating input characters. The conversion candidate characters denote candidates for the input character. Thereafter, a series of search character images is selected from the input character images, and a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images are produced from the particular conversion candidate characters by repeatedly selecting the series of particular conversion candidate characters corresponding to the series of search character images. Thereafter, the registered text data indicating one or more registered documents is searched for each particular conversion candidate character string. Therefore, an occurrence frequency of each particular conversion candidate character string in the registered text data can be counted. Thereafter, a specific particular conversion candidate character string corresponding to the highest occurrence frequency is selected, and a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string is determined as a series of correct characters. Therefore, the series of search character images can be correctly recognized as the series of correct characters.
Accordingly, because the invention searches registered text data for each particular conversion candidate character string to count an occurrence frequency of each particular conversion candidate character string in the registered text data and to select the specific particular conversion candidate character string corresponding to the highest occurrence frequency, it is unnecessary to prepare a dictionary such as a word dictionary requiring a lot of labor, time and maintenance, and it is unnecessary to perform a morphological analysis requiring a lot of time to configure a conventional character recognizing apparatus or perform a conventional character recognizing method in order to conduct complex processes, although tending to make a lot of mistakes. As a result, it is possible to readily configure the character recognizing apparatus, and correctly recognize characters within a short period.
Also, because the invention performs a full text search rather than a keyword search, segmentation of a series of search input characters is not restricted by the concept of a word, a sentence, a clause or the like, and the series of search input characters can be freely set. It is therefore possible to desirably set a speed or an accuracy of the post processing at need.
The object of the present invention is further achieved by the provision of a character recognizing method, comprising the steps of:
recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document;
calculating an evaluation value indicating a degree of certainty of one conversion candidate character for each of the conversion candidate characters corresponding to the input character images;
selecting one or more particular conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters from the conversion candidate characters corresponding to one input character image for each of the input character images;
selecting a series of search character images indicating a series of search input characters from the series of input character images;
selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
preparing registered text data indicating one or more registered documents;
searching the registered text data for one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string frequently occurred in the registered text data from the particular conversion candidate character strings; and
determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images.
The object of the present invention is also achieved by providing a character recognizing apparatus, comprising:
character recognizing means for recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, calculating an evaluation value indicating a degree of certainty of one conversion candidate character for each of the conversion candidate characters corresponding to the input character images, selecting one or more particular conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters from the conversion candidate characters corresponding to one input character image for each of the input character images, selecting a series of search character images indicating a series of search input characters from the series of input character images and selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
registered text data storing means for storing registered text data indicating one or more registered documents;
fill text searching means for searching the registered text data stored by the registered text data storing means for one particular conversion candidate character string for each of the particular conversion candidate character strings produced by the character recognizing means to obtain a full text search result;
post-processing means for selecting a specific particular conversion candidate character string frequently occurred in the registered text data from the particular conversion candidate character strings according to the full text search result obtained by the full text searching means and determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images; and
registered text data outputting means for outputting the series of correct characters determined by the post-processing means as the series of search character images.
In the above steps and configuration, an evaluation value indicating a degree of certainty of one conversion candidate character is calculated for each of the conversion candidate characters, and one or more particular conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters are selected from the conversion candidate characters for each of the input character images. Thereafter, the registered text data is searched for each particular conversion candidate character string, a specific particular conversion candidate character string frequently occurred in the registered text data is selected from the particular conversion candidate character strings, and a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string is determined as a series of correct characters. Therefore, the series of search character images can be correctly recognized as the series of correct characters.
Accordingly, because the invention calculates an evaluation value of one conversion candidate character for each of the conversion candidate characters to select one or more particular conversion candidate characters corresponding to the higher evaluation values from the conversion candidate characters for each of the input character images, the number of particular conversion candidate character strings can be reduced, and a time required for a full text searching operation can be reduced. Also, because a specific particular conversion candidate character string frequently occurred in the registered text data is selected from the particular conversion candidate character strings, it is unnecessary to prepare a dictionary such as a word dictionary requiring a significant amount of labor, time and maintenance, and it is unnecessary to perform a morphological analysis requiring a significant amount of time to configure a conventional character recognizing apparatus or perform a conventional character recognizing method in order to conduct complex processes, although tending to make a lot of mistakes. As a result, it is possible to readily configure the character recognizing apparatus, and to correct misrecognized characters within a short period.
Preferably, the step of searching the registered text data comprises the step of searching the registered text data and the input document for one particular conversion candidate character string for each of the particular conversion candidate character strings to count an occurrence frequency of the particular conversion candidate character string in the registered text data and the input document for each of the particular conversion candidate character strings.
In accordance with the invention, it is also preferred that the step of searching the registered text data comprise the step of searching the registered text data and the input document for one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string frequently occurred in the registered text data and the input document from the particular conversion candidate character strings.
In accordance with the invention, it is further preferred that the step of searching the registered text data comprise the steps of:
searching the registered text data for one particular conversion candidate character string for each of the particular conversion candidate character strings to count a first occurrence frequency of the particular conversion candidate character string in the registered text data for each of the particular conversion candidate character strings;
determining a threshold value lower than the highest first occurrence frequency by a prescribed value;
selecting one or more first selected conversion candidate character strings corresponding to the first occurrence frequencies equal to or higher than the threshold value among those of the particular conversion candidate character strings from the particular conversion candidate character strings;
searching the input document for one first selected conversion candidate character string for each of the first selected conversion candidate character strings to count a second occurrence frequency of the first selected conversion candidate character string in the input document for each of the first selected conversion candidate character strings; and
selecting a specific particular conversion candidate character string corresponding to the highest second occurrence frequency among those of the first selected conversion candidate character strings from the first selected conversion candidate character strings.
In the above steps, because the input document itself is used for a full text searching operation, tendency of the input document such as words, grammar and the like used in the input document can be reflected upon correctly recognizing characters, and an unregistered word not registered in any registered documents can be searched since the unregistered word very likely appears in its own document. Therefore, a rate of character recognition can be improved.
In accordance with the invention, it is preferred that the step of recognizing an input character image include the steps of:
determining a character image position of each input character image in the input document; and
extracting each input character image from the input document according to the character image position, and that the step of calculating an evaluation value include the steps of:
again determining a second character image position of the input character image in cases where all evaluation values of the conversion candidate characters corresponding to the input character image supposed to be placed at the first character image position are lower than a threshold value and again extracting the input character image supposed to be placed at the second character image position from the input document,
again recognizing the input character image placed at the second character image position as one or more conversion candidate characters; and
again calculating an evaluation value of each conversion candidate character corresponding to the input character image placed at the second character image position.
In the above steps, in cases where all evaluation values of the conversion candidate characters corresponding to the input character image placed at the second character image position are lower than a threshold value, it is judged that the input character image placed at the second character image position is extracted from the input document according to the character image position not correctly indicating a position thereof. Therefore, the input character image placed at the second character image position is again extracted from the input document according to the character image position thereof, is recognized as one or more conversion candidate characters, and an evaluation value of each conversion candidate character is again calculated.
Accordingly, even though an input character image is incorrectly extracted from the input document, the input character image can be again extracted from the input document, so that a rate of character recognition can be improved.
In accordance with the invention, it is preferred that:
the step of recognizing an input character image includes the steps of specifying a plurality of character regions existing in the input document; and
extracting each of the input character images from the character regions; that
the step of selecting a series of search character images includes the step of combining one or a series of particular input character images extracted from a final portion of one character region and one or a series of particular input character images extracted from a top portion of another character region into the series of search character images, for each pair of character regions, and that
the step of determining a series of specific particular conversion candidate characters includes the step of coupling a first character region and a second character region together in that order, in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string, for each specific particular conversion candidate character string.
In accordance with the invention, it is also preferred that the character recognizing apparatus further comprise:
character extracting means for specifying a plurality of character regions existing in the input document and extracting each of the input character images from the character regions, wherein one or a series of particular input character images extracted from a final portion of one character region and one or a series of particular input character images extracted from a top portion of another character region are combined into the series of search character images by the character recognizing means, for each pair of character regions; and
region coupling means for coupling a first character region and a second character region extracted by the character extracting means together in that order, in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string by the post-processing means, for each specific particular conversion candidate character string.
In the above steps and configuration, even though a character area of the input document is divided into a plurality of character regions, because a first character region and a second character region are coupled together in that order in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string, the character regions can be coupled together in a correct order.
In accordance with the invention, it is preferred that the step of searching the registered text data comprise the steps of
selecting a plurality of shortened conversion candidate character strings corresponding to a series of search character images from the particular conversion candidate character strings;
searching the registered text data for one shortened conversion candidate character string for each of the shortened conversion candidate character strings to count an occurrence frequency of the shortened conversion candidate character string in the registered text data for each of the shortened conversion candidate character strings;
selecting a specific shortened conversion candidate character string corresponding to the highest occurrence frequency along those of the shortened conversion candidate character strings from the shortened conversion candidate character strings;
producing a plurality of particular conversion candidate character strings respectively including the specific shortened conversion candidate character string and corresponding to the series of search character images; and
searching the registered text data for each particular conversion candidate character string to count an occurrence frequency of each particular conversion candidate character string in the registered text data.
In the above steps, because the invention permits a full text searching operation for each shortened conversion candidate character string to be performed before a full text searching operation for each particular conversion candidate character string is performed, a time required for the full text searching operation can be shortened, and characters can be correctly recognized at a short time.
In accordance with the invention, it is preferred that the step of recognizing an input character image include the step of specifying an input attribute of the input document, that
the step of preparing registered text data includes the step of classifying the registered text data into the plurality of registered documents respectively specified by a registered attribute, and that
the step of searching the registered text data comprises the steps of selecting one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document, from the registered documents; and searching the particular registered documents for one particular conversion candidate character string for each of the particular conversion candidate character strings to count an occurrence frequency of the particular conversion candidate character string in the registered text data for each of the particular conversion candidate character strings.
In accordance with the invention, it is also preferred that the step of recognizing an input character image include the step of specifying an input attribute of the input document, that
the step of preparing registered text data includes the step of classifying the registered text data into the plurality of registered documents respectively specified by a registered attribute, and that
the step of searching the registered text data comprises the steps of
selecting one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document, from the registered documents; and
searching the particular registered documents for one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string frequently occurred in the registered text data from the particular conversion candidate character strings.
In the above steps, because the invention searches particular registered documents, having the same attribute as that of the input document, for each particular conversion candidate character string, a character recognition reflecting upon an attribute of the input document can be performed, and the character recognition can be more correctly performed.
It is preferred that the step of preparing registered text data include the step of preparing pieces of misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, and that
the step of searching the registered text data comprise the steps of searching the misrecognized character strings of the pieces of misrecognition data for one particular conversion candidate character string for each of the particular conversion candidate character strings;
recognizing the series of search character images as a series of correct characters composing a correct character string corresponding to one particular conversion candidate character string in the pieces of misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings; and
searching the registered text data for one particular conversion candidate character string for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings, to count an occurrence frequency of the particular conversion candidate character string in the registered text data for each of the particular conversion candidate character strings.
In accordance with the invention, it is also preferred that the step of preparing registered text data include the step of preparing pieces of misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, and that
the step of searching the registered text data comprise the steps of searching the misrecognized character strings of the pieces of misrecognition data for one particular conversion candidate character string for each of the particular conversion candidate character strings;
recognizing the series of search character images as a series of correct characters composing a correct character string corresponding to one particular conversion candidate character string in the pieces of misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings; and
searching the registered text data for one particular conversion candidate character string for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings, to select a specific particular conversion candidate character string frequently occurred in the registered text data from the particular conversion candidate character strings.
In the above steps, the misrecognized character strings of the pieces of misrecognition data are searched for each particular conversion candidate character string. In cases where one particular conversion candidate character string exists in the misrecognized character strings, the series of search character images is recognized as a series of correct characters composing a correct character string corresponding to the particular conversion candidate character string in the pieces of misrecognition data.
Accordingly, because the invention does not require a searching operation for the registered text data in cases where one particular conversion candidate character string exists in the misrecognized character strings, character recognition can be performed at a short time.
In accordance with the invention, it is preferred that the character recognizing apparatus further comprise
layout storing means for storing an input layout of the input character images of the input document recognized by the character recognizing means; and
displaying means for displaying a corrected document, which is obtained by replacing the series of search character images of the input document selected by the character recognizing means with the series of correct characters determined by the post-processing means, in the input layout of the input document stored by the layout storing means.
In the above configuration, input character images of the input document are changed to correct characters to obtain a correct document, and the correct document is displayed in the same layout as that of the input document.
Accordingly, the invention makes it possible to display the correct document in a layout looking like the input document, so that the user can see easily the correct document. In addition, because image data is not merely displayed but text data is displayed, it is possible to easily edit the correct document.
In accordance with the invention, it is preferred that the step of a series of search character images comprises the steps of detecting a series of particular input character images sandwiched by a pair of partition symbols from the series of input character images of the input document the input document; and
setting the series of particular input character images as the series of search character images indicating the series of search input characters.
In the above steps, because a series of particular input character images sandwiched by a pair of partition symbols is set as the series of search character images, the series of search input characters indicated by the series of particular input character images has a meaning.
Accordingly, the invention makes it possible to recognize characters according to characteristics of a language used in the input document while avoiding a meaningless character string which is a part of a word.