1. Field of the Invention
Exemplary embodiments of the present invention relate to a method, system and computer readable recording medium for correcting an optical character recognition (OCR) result, and more specifically, to a method, system and computer readable recording medium, in which the OCR result is provided after removing all carriage returns except the carriage returns indicating the start or an end of a paragraph, and correcting word spacing using the Hidden Markov Model (HMM) or the like in providing a character recognition result, thereby providing the character recognition result, in which grammatically correct word spacing is reflected, without having unnecessary carriage returns.
2. Discussion of the Background
As the Internet is widely used and various types of information are distributed through information communication networks, people increasingly depend on the Internet, which functions as an information acquiring means.
Particularly, efforts for recognizing character information included in information that exists in the form of an image or a moving image and converting the character information into mechanically readable information are made so that Internet users may use the character information more easily. For example, various character recognition techniques for analyzing characters in an image and converting the characters into mechanically readable text information have been developed and used. Among the techniques, an optical character recognition (OCR) technique is widely used.
FIG. 1a is a view showing an example of an image of an optical character recognition target, and FIG. 1b is a view showing a result of performing character recognition on the image of FIG. 1a according to a conventional optical character recognition technique.
As described above, in conventional optical character recognition, a result as shown in FIG. 1b is outputted by analyzing an area including characters from the image shown in FIG. 1a and recognizing mechanically readable characters from the area. Editable or modifiable text information may be obtained using such a character recognition technique.
However, according to the conventional optical character recognition technique, as shown in FIG. 1a and FIG. 1b, character information shown in different lines in the original image information may be presented in different lines in a character recognition result corresponding to the original image information. That is, although a sentence or paragraph is not complete in the original image information, it may be split in different lines depending on the size of the area including the character information. In the conventional optical character recognition technique, all of the split lines are recognized as being applied with a carriage return, and a result of applying “Enter” characters between the lines is outputted. Accordingly, if a sentence or paragraph is split into many lines due to the narrow width of the original text as shown in FIG. 1a, although the character information of “Istanbul (historically also known as” and the character information of “Byzantium and Constantinople) is the largest” are included in one connected sentence having a length that can be expressed in one line of a general word processor document, as shown in the conventional recognition result of FIG. 1b, they are shown in different lines if they are copy-and-pasted into a word processor document or the like.
Furthermore, if the image information of a character recognition target includes a plurality of pages, although the last information of a previous page and the first information of a next page may be information included in one sentence or paragraph, a carriage return may be recognized between the two pieces of information, and they may be outputted in different lines in a character recognition result.
When a user edits or modifies such a character recognition result using a word processor or the like, the user may find it necessary to delete a large amount of “Enter” characters (i.e., carriage returns) from the conventional recognition result.
Another problem is that conventional OCR techniques may output improper word spacing included in a character recognition result, and thus it may be beneficial to develop a technique for outputting a character recognition result where unnecessary carriage returns are removed and correct word spacing is reflected.