Optical Character Recognition (OCR) systems are known. These systems automatically convert a paper document into a searchable text document. OCR systems are typically composed of three main steps: line segmentation, feature extraction and character classification. But, as illustrated in FIG. 1, feature extraction is often presented as part of the character classification. In that way, starting from an image of a character string, known optical character recognition systems are first applying a line segmentation to become images of individual characters and subsequently a character classification step is executed to identify the characters. While character classification techniques have become extremely robust over the past years, line segmentation remains still a critical step of OCR, in particular in the case of Asian text.
Different approaches of line segmentation exist (also often called character segmentation). The image representing a text line is decomposed into individual sub-images which constitute the character images. Different methods can be used to segment a line. A known line segmentation method is the detection of inter-character breaks or word breaks (adapted to Latin characters) as a way to isolate individual characters. This is described for example in WO2011128777 and WO201126755.
Another known line segmentation method, described for example in WO2011142977, uses chop lines which are processed afterwards to identify the lines that separate characters. Still other methods, such as for example in EP0138445B1, assume a constant pitch between characters.
Above described line segmentation methods are known as dissection methods. This type of method is less efficient for text composed of Asian text and Asian text combined with Latin text because in this type of text there is often no clear break or pitch between characters and Asian characters are not made of a single connected component but mostly of several connected components (e.g. radicals for Chinese characters).
Another type of method of line segmentation is based on the recognition of components in the image that match classes in a particular alphabet. Such methods require however long computation times.
A third type of segmentation technique uses a combination of the first two and is known as “oversegmentation” method. The image is oversegmented with different dissection methods as illustrated in FIG. 2. Several plausible segmentation solutions are analyzed by the same or different character classification methods and the best segmentation solution is then chosen. When the segmentation becomes difficult, as is the case for example for Asian characters, many possible segmentation solutions are evaluated which leads to extremely long computation times for analyzing the input string image.