On-line Chinese character recognition (OLCCR) systems provide a conduit of man-computer interfaces for Chinese-character-based operation systems and/or applications; they can be used alone, but are typically used in conjunction with other application software programs. Unlike the English language, which consists of 26 letters and a few punctuation, the Chinese language is a large-vocabulary language comprising literally tens of thousands of "characters" (i.e., words), which are further combined to become meaningful phrases. A Chinese character will not be recognized if it does not exist in the reference database. Therefore, a very large reference database would be required in developing a character recognition system for Chinese characters. This may not possess a problem for-off-line character recognition systems. However, the data size (for storing template characters against which the input script can be compared) becomes a very significant issue for on-line character recognition systems, especially when the character recognition systems are to be used in portable devices such as personal digital assistants (PDAs). In order to increase the "vocabulary" of a portable device so as to thereby increase the usefulness of the character recognition system, it is important to develop a method which can substantially reduce the memory space required for storing a very large number of reference (i.e., template) Chinese characters.
While the written Chinese language comprises such a huge number of characters, essentially all of them can be decomposed into a combination of "components", which are sometimes called "radicals". Thus, although there are essentially an infinite number of possible Chinese characters, there exist only a limited number of components. In most cases, a "component" is also a character itself, but of a much simpler construction. Since all the characters share a significantly smaller number of components, it is, theoretically, possible to store the components rather than the characters to reduce the memory space requirement. However, this is complicated by the fact that, in Chinese characters, the arrangement of the components (i.e., the spatial relationship between or among the components) is not linear as in, for example, the English language. Rather, the components in Chinese characters are arranged in a two-dimensional relationship--the arrangement can be top-down, left-right, outside-inside, or, in many circumstances, combinations thereof. Moreover, the dimension of one component can be substantially larger or smaller, two-dimensionally speaking, than other components of the same character. This can be a result of the intrinsic characteristics of the character itself, or as a result of a personal writing habit. These factors make the development of an on-line component-based character recognition system extremely difficult.
Lu and Suen proposed a hierarchical attributed graph representation for off-line hand-written Chinese characters. In their approach, each component is represented by an attributed graph, in which each vertex stands for a stroke. Each character is also represented by an attributed graph, in which each vertex stands for a component. A hierarchical attributed graph matching is then developed for character recognition. This method does not result in a substantial saving in the memory storage space when the reference data contains a large number of template characters.
Shiau et al proposed an on-line Chinese character recognition system which utilizes a hierarchical representation. Their system involves a fixed stroke number and a fixed stroke order, by which each component is represented by a string of symbols, which contain the information of basic stroke types and the spatial relationship between two contiguous strokes. Each Chinese character is then represented by the constituent components. When a character is retrieved from the reference database for character matching, the string of symbols of the character can be obtained by concatenating the symbol strings of the constituent components.
Morishita et al proposed a method for Chinese/Japanese character recognition (Japanese characters share many common strokes with Chinese characters), in which the stoke number and the x and y coordinates of the center points of the strokes are utilized to decompose the components contained in a character, of which the components are assumed to have a left-to-right or top-to-bottom relationship. The method developed by Morishita et al can be applied recursively to correctly decompose the constituent components from characters with complicated structures, but it fails for characters with connected strokes.
Correct decomposition of radicals (i.e., components) is a difficult but critical procedure for an on-line Chinese character recognition without writing constraints (such as stroke order and stroke number) using a hierarchical reference database. To some extent, the same problem is experienced in the off-line Chinese character recognition system. In the development of off-line Chinese optical character recognition (OCR) systems, the extraction of radicals has been investigated by many researchers. Typically, the background information of a handwritten character is utilized to separate radicals. However, in the conventional approach, radicals that touch each other or are inherently connected cannot be satisfactorily separated. Cheng and Hsu proposed a method for separating radicals according to the heuristics of the stroke connections. About 75% of the most frequently used Chinese characters can be divided into two parts by means of this separating strategy. However, the remaining 25% of the Chinese characters must be processed according to other special rules. Other researchers have proposed off-line Chinese character recognition methods which utilize relaxation and graph matching methods to identify radicals. These methods, however, are computation intensive.
As discussed above, although a hierarchically structured component-based Chinese character recognition system may substantially reduce the memory space requirement, thus facilitating the development of an on-line system for portable devices with limited hardware resource, because of the complexity of the Chinese characters which involve two-dimensional spatial relationships of the constituent components and the variable component size, no such system has been successfully developed.
The primary object of the present invention is to develop an improved on-line Chinese character recognition system which can be used for portable computing devices with limited hardware resources such as personal digital assistants. More specifically, the primary object of the present invention is to develop an improved on-line Chinese character recognition system, which includes a hierarchically structured reference database and a method of using the same, for use in an on-line handwritten Chinese character recognition process which allows the required data storage space to be substantially reduced while providing excellent character recognition efficiency. The system disclosed in the present invention can also be advantageously utilized to generate an essentially infinite number of Chinese-character-based print and/or screen fonts.
The present invention discloses an improved on-line Chinese character recognition system which includes a novel hierarchically structured database for storing template Chinese characters, against which an input handwritten Chinese script will be compared and recognized. The improved on-line Chinese character recognition system is a rule-based system and the method of the hierarchical representation of the template Chinese characters involves storing (i.e., describing) the template characters as comprising three major parts: character patterns, spatial relationships between strokes, and stoke correspondence rules, all according to a hierarchical structure.
With the method disclosed in the present invention, instead of storing the stroke correspondence rules of each and every character, the stroke correspondence rules of components are stored in the reference (i.e., template) database. Each template character (of the vocabulary of characters) is described by: (1) its constituent component codes and (2) the character composition structure. The character composition structure is also called the "character structure", which describes the spatial relationship of its constituent components. In the present invention, which involves a rule-based system, a handwritten character is compared against a template character according to whether the handwritten character conformed to a set of rules defining the template character. In retrieving the rules of a character, the rules of its constituent components are fetched according to a predefined sequence associated with that template character. The use of such a predefined sequence is necessary for Chinese characters because of the complex spatial relationship between the components. In most written languages, for example the English language, the constituent letters of a word are arranged only linearly in one direction, i.e., from left to right. For Chinese characters, the constituent components are arranged in various two-dimensional manners, from left-to-right, top-to-bottom, outside-to-inside, or even combinations thereof Thus one or more predefined sequences are necessary in order for the components (or, more specifically, the rules of the constituent components) to be extracted (or retrieved) from a character. The provision of the series of predefined sequences for extracting components from a character is an important element of the system disclosed in the present invention.
In the present invention, the three major parts (i.e., character patterns, spatial relationships between strokes, and stoke correspondence rules) which collectively define a character are themselves defined as comprising the following five items, or databases:
(1) the database of character description; PA1 (2) the database of stroke correspondence rules of components; PA1 (3) the database of character structures; PA1 (4) the database of standard component patterns; and PA1 (5) the database of spatial relationships between strokes of components.
The database of character description stores the character structure and component codes for each template character. The database of stroke correspondence rules of components stores the stroke correspondence rules for the strokes contained in every component. The database of character structures stores a number of information for constructing a character from, or decomposing into, its constituent components, including the synthesis rules for the character patterns (i.e., the synthesis rules at the character level from the constituent components), the decomposition rules of components for each character structure, and the spatial relationships between the components for each character structure. The database of component patterns stores the normalized standard component patterns. Finally, the database of spatial relationships between stokes of components stores the spatial relationships between stokes of each component.
In the method disclosed in the present invention for recognizing handwritten Chinese characters, a template Chinese character is first retrieved from the hierarchical structured reference database. In the database of character description, each template character is represented by the rule code(s) of its constituent component(s) and the character structure associated therewith. The database of stroke correspondence rules of components contains stroke correspondence rules for all the components, which are denoted by the rule codes. The rule codes (of components) provide the connection to form a hierarchical relationship between the character (as in the database of character description) and the strokes (as in the stroke correspondence rules of the components) for saving the required memory storage space. However, the actual task is actually much more complicated. As described before, because of the unique complicated two-dimensional relationships between the constituent components in Chinese characters, two other links, in the form of two separate databases, are provided in the present invention to ensure the workability of the hierarchical relationship between the characters and the stroke correspondence rules of components. One is the database of character structures, which contains the synthesis rules of the character patterns, the decomposition rules of character structures and the spatial relationships between the components in a character. The other is the database of standard component patterns (pattern codes) which stores the coordinates of extreme points of line segments, which constitute the standard patterns of components. A component will have only one pattern code, but may have more than one rule code. A mapping table is provided to store the mapping relationships between the rule code(s) and a pattern code.
When an input script is matched against a template character, the input script is decomposed into one of more components (described as component codes), in accordance with the decomposition rule of that character (via the database of character description and the database of character structure). A number of decomposition rules are stored in the database of character structure to suit the various writing styles. Once the component codes are extracted, the stroke correspondence rules are retrieved, via the database of stroke correspondence rules of components. From the rule code(s), which allow the best match of strokes to be obtained, the corresponding pattern code(s) can be retrieved via the mapping table. Alternatively, the standard component pattern(s) can also be retrieved via the pattern code(s). By using the standard patterns of the constituent components in conjunction with the synthesis rules of the character structure, the user can also synthesize the standard pattern of the template character. Thus, the database system disclosed in the present invention can also be advantageously utilized to generate Chinese characters on-the-fly for screen display or printing hard copies.
In a preferred embodiment of the present invention, which was based on a collection of 5,401 of the most frequently used Chinese characters, a hierarchical database system disclosed in the present invention has been shown to be able to reduce the required memory storage space by almost 75%, while providing the same or better recognition accuracy. Thus, the present invention presents an important avenue by which portable computing devices with limited hardware resources such as personal digital assistants can utilize to provide on-line character recognition for users of Chinese characters. While the system developed in the present invention provides the most advantageous utility for Chinese character-based word processors, it can also be utilized, in conjunction with any software application program, including those English and/or other non-Chinese character based word processors, to provide the capability of inputting, displaying, and printing Chinese characters without substantially inflating the program size. Furthermore, the method disclosed in the present invention can be applied to other non-Chinese languages which involve non-linear arrangements of character components, such as the Japanese and Korean language. Additionally, because of the efficient use of memory space, the hierarchical database system structure developed in the present invention can be conveniently embedded into a firmware for use with a printer, which can be a dot matrix printer or preferably a laser printer, so as to facilitate Chinese character printing capability without incurring large memory requirement.