A document to be processed by OCR is initially and usually presented as an electronic image obtained from a camera, a scanner, or from conversion from one file format to another. In document processing, OCR is typically performed automatically or semi-automatically by a software program or a program module executed on a personal, mobile or other computer. However, there are many barriers to successful OCR, or OCR processing with a high degree of successful recognition, especially when attempting to recognize CJK and other types of glyphic characters.
Documents written by CJK speakers include characters of one or more glyphic languages and increasingly include non-standard characters (letters, symbols, numeral, punctuation marks) from one or more other languages including European languages. Such other non-CJK languages are generally based on a Latin, Cyrillic or other non-glyphic alphabet. Herein, reference is made to CJK characters, but such reference is shorthand for all varieties of glyphs, characters, tetragraphs, tetragrams, symbols, ideographs, ideograms and the like.
Written or printed text in a European language usually consists of repeated use of 100-150 standardized characters to form phonetic words. In contrast, texts in CJK languages usually use a subset of 30,000-40,000 available characters. A typical person routinely exposed to CJK characters encounters about 5,000 different CJK characters per day. Because of this characteristic of CJK and other glyph languages, it is difficult or impossible to recognize CJK texts by ordinary methods and techniques used in the recognition of characters and words in Roman, Latin or Cyrillic alphabets.
FIG. 1 is an example of an image of a document 100 that includes CJK text 102 (Japanese) in a horizontal direction and CJK text in a vertical direction. The CJK text 102 also includes Roman characters mixed with the CJK characters. The document 100 also includes a region 104 with a portrait or picture and a caption under the picture. FIG. 2 is an English translation 200 of the CJK text of FIG. 1.
While reading CJK characters is a relatively easy task for a person, a machine often has difficulty isolating and recognizing CJK characters. One difficulty arises when alphanumeric and other non-CJK characters are mixed into traditional CJK writing. Another difficulty arises when the direction of writing cannot easily be ascertained. CJK writing often does not include any punctuation. CJK writing may be in different directions on a single page of text. Further difficulties can arise when both traditional and simplified CJK characters are mixed together, as is often the case in formal printed publications.
There are various methods of attempting to overcome the difficulties in recognizing CJK characters. Analytically, recognition can generally be divided into two types of methods. The first type is by recognizing each character as it is being written—a form of online or active recognition. This type of recognition often involves analyzing strokes as they are entered by a stylus or finger on a touch-sensitive screen.
The second type of recognition involves segmenting individual CJK characters on each page of a document and then recognizing each character by matching it to a character in a database of characters. This type of recognition is termed offline recognition, and can be divided into handwritten character recognition (HCR) and printed character recognition (PCR). In each of these types of offline recognition, feature matching and/or structural analysis is performed. The techniques described herein apply to both HCR and PCR recognition, and generally to all types of offline and online recognition of CJK characters.
CJK characters generally occupy a square area in which the components or strokes of every character are written to fit. This technique allows CJK characters to maintain a uniform size and shape, especially with small printed characters in either sans-serif or serif style. The uniform size and shape allows dense printing of such CJK characters. However, the dense printing can be a source of trouble for segmenting and recognizing CJK characters, lines and paragraphs. There are many ways that segmenting, recognition and processing of CJK characters can be improved.