Optical character recognition (OCR) is a computer-based translation of an image of text into digital form as machine-editable text, generally in a standard encoding scheme. This process eliminates the need to manually type the document into the computer system. An OCR process typically begins by obtaining an electronic file of a physical document bearing the printed text message and scanning the document with a device such as an optical scanner. Such devices produce an electronic image of the original document. The output image is then supplied to a computer or other processing device and processes the image of the scanned document to differentiate between images and text and determine what letters are represented in the light and dark areas.
As a result of the increasing use of computers and the Internet, coupled with the more frequent usage of English language around the world, it has become common to find textual images that include a combination of Western words and East Asian (e.g., Chinese, Japanese, Korean) text, often in the form of Western Words mixed within a selection of East Asian text. Accordingly, an OCR engine that is to be used with East Asian text should ideally be able to recognize a textual line with a mix of East Asian and Western text.