Optical Character Recognition (OCR) methods convert the image of text into machine-readable code by using a character recognition method to identify the characters represented on the image.
Known optical character recognition methods start with an image including a string of characters, and, with an OCR engine, provide an ID of the characters present in the string of characters, i.e., an identification of the characters in machine-readable code to obtain a searchable string of characters.
Many OCR engines exist. They have to work fast, with limited computing resources and accurately recognize characters. Speed, limited resources and accuracy are contradictory requirements and in practice, a good OCR engine is based on trade-offs between these characteristics.
An OCR engine designed for the recognition of Latin characters (e.g. English) is different than an OCR engine designed for the recognition of Asian characters (Chinese, Japanese and Korean) or Arabic characters. For instance, the database of identification is different even if some characters like punctuation signs and numerical digits may be present in several databases. The database of Latin characters may contain less than 100 characters, while the database of Asian characters may contain about 5000 characters per language. Therefore, an OCR engine designed for Asian characters typically requires more memory than an OCR engine designed for Latin characters. Algorithms that have to take into account the diversity of characters are optimized differently due to this large discrepancy in the number of characters. The features used for character recognition are different because the shapes of Latin characters are simpler than the shapes of Asian characters that can contain many strokes, but the shapes of Latin characters have more variations due to a high number of Latin fonts. Furthermore, contextual decision algorithms that make the final decision about the character identification by using linguistic and typographic models are different. Linguistic models for Latin languages use especially a language dictionary with probabilities of occurrence of words, while Linguistic models for Asian languages use especially a character n-grams with probabilities of occurrence. (A character n-gram is a sequence of n consecutive characters). Another reason why OCR engines are different for Latin and Asian characters is that there are no spaces between words in Chinese or Japanese texts.
Altogether, using a known OCR engine for multiple types of characters like Latin and Asian does not provide the desired outcome being accurate, fast and requiring low computing resources. That's why known OCR engines are typically designed for only one type of characters, and if a known OCR engine includes the possibility of recognizing characters of another type of characters, its accuracy for recognizing that other type of characters is typically low. This lack of accuracy is especially an issue because many documents today are containing a mix of different types of characters, such as for example a Japanese invoice or purchase order that contains Japanese text but also English names, English postal addresses, email addresses, amounts in numbers, . . . .