Character recognition for East-Asian languages, such as Japanese, Chinese, or Korean, is a challenging pattern-recognition problem. Several aspects of these characters make the problem difficult for conventional machine learning techniques based on Bayesian approaches, neural networks, support vector machines, template matching, and nearest neighbor algorithms. One aspect that contributes to the difficulty is the total number of characters in each of the languages. The number of common characters in an East-Asian language typically ranges from 8,000 to more than 20,000. Several characters are shared among these languages, but the number of unique characters per language is still very large. Current recognition and classification approaches are effective for languages with less than a few hundred classes, such as English, French, and German. However, these approaches do not directly scale to the thousands or even tens of thousands of East-Asian characters.
Another challenge is segmenting an East-Asian character in handwritten form. In handwritten form, strokes may merge into continuous curves. The merging of strokes is so common place that common handwritten character forms have evolved that do no look like their printed counter parts, but allow for easy reading and writing. This common merging of strokes makes stroke segmentation very difficult. Therefore, traditional stroke decomposition approaches that rely on stroke segmentation for character recognition perform poorly on handwritten or cursive characters.