Text is the representation of written language. Printed text can be processed, for example, using optical character recognition (OCR). OCR is the electronic conversion of scanned images into machine-encoded text. The converted machine-encoded text may then be electronically searched and/or used in various machine processes, such as text mining, machine translation, etc. When running an OCR application on a scanned image, boundary information for the text is created. In character recognition, boundaries can be a real or imaginary rectangle which serves as the delimiter between consecutive letters, numbers, and/or symbols in characters (e.g., Chinese or Japanese characters). The boundary information can include the rectangular coordinates for the lines that make up Chinese or Japanese characters.
Typically, when a scanned image is of poor quality or if the scanned image contains logographic characters (e.g., Japanese or Chinese characters), the OCR application may make mistakes in detecting the boundaries, and applications and processes, which may rely on the boundary information, may generate incorrect results. For example, portions of the same character may be treated as separate characters or separate characters can be treated as portions of the same character, causing typographical and grammatical errors. Editors may spend a significant amount of time trying to detect and correct the errors.