1. Field of the Invention
The present invention relates to a technique for determining a reading order of characters for a set of characters extracted from image data by character recognition processing by computer operations. In particular, the present invention relates to a technique for properly determining a reading order of characters even after a modification for correcting a character recognition error is performed.
2. Description of Related Art
When there is a character recognition error in text data acquired by an optical character reader (OCR), it is necessary to edit the reading order of characters in accordance with the modification of character region. When the error of character recognition is a recognition error in the unit of character and is modified by the integration or division of character regions, a new reading order of characters can be determined. This is done by computer operations utilizing the orders assigned to character regions before modification.
Japanese Patent Publication 2008-225964A discloses a technique as preprocessing of OCR, in which a region to be processed is divided according to a predetermined identification condition so that image regions are set and a reading order is set for each region, and when a user instructs a modification to integrate regions, a plurality of regions before modification which overlap with a newly created region are searched to succeed the reading order which has been assigned to a region having the largest overlap area among a plurality of detected regions, as a reading order of newly integrated region.
The technique for automatically correcting reading order disclosed by Japanese Patent Publication 2008-225964A, however, is based on a premise that a newly created area overlaps with a region before correction. For that reason, even if the above described automatic correction technique is applied to the correction of the reading order of characters at the time of correction of an error of character recognition by OCR, it is necessary to manually edit a text sequence when a correction such as newly adding a region which has not been recognized at all, like an omission of character.
The present invention solves the above described problems and aims to provide a technique for determining a reading order of characters by computer operations. The technique is applicable to the modification of character region due to addition of a character region which has not been recognized. Further, it is another object of the present invention to provide a technique for determining the reading order of characters by means of computer operations, which can be applied to all types of modification including: integration, division, new insertion of a character region.
The present invention determines the reading order of a character of a character region contained in a line box according to the alignment direction of characters in the line box, by preparing in advance a list of line information, in which line information made up of a line box surrounding a set of characters which are continuously aligned in the same direction in image data, and an alignment direction of characters in the line box, is listed in the alignment order of lines. Therefore, according to the present invention, it is possible to determine the reading order of characters after modification based on the alignment direction of characters of a line box containing a modification region even if any of modifications of integration, division, or new insertion of character region. Other advantageous effects of the present invention will be understood from the description of each embodiment.