The present invention relates generally to data processing systems and, more particularly, to radical definition and dictionary creation in a handwriting recognition system.
Kanji is a Japanese system of writing that utilizes characters borrowed or adapted from Chinese writing. The elements of grammar in Kanji are known as xe2x80x9cKanji characters.xe2x80x9d The phrase xe2x80x9celements of grammarxe2x80x9d refers to units of a given natural language that are capable of comprising parts of speech. For example, the elements of grammar in the English language are words. As such, each Kanji character is a higher order linguistic symbol that is analogous to a word in the English language. That is, natural languages tend to have three levels of linguistic elements. The lowest of these levels depends on the specific alphabet used and is associated with the sounds of the spoken language. For example, the first and lowest level of linguistic elements in the English language comprises letters. The third level of linguistic elements is the highest level and contains linguistic elements conveying full creative expression. In the English language, the third level comprises sentences. It is the second level of linguistic elements to which the phrase xe2x80x9celements of grammarxe2x80x9d refers. This second level is an intermediate level of linguistic elements and in the English language, the second level comprises words. In Japanese, the second level comprises Kanji characters.
Kanji characters typically comprise radicals. A xe2x80x9cradicalxe2x80x9d is a part of a Kanji character, much like letters are part of a word. Oftentimes, a radical is itself a Kanji character. For example, FIG. 1 depicts a Kanji character 100 that comprises two radicals 102 and 104. Radical 102 is the xe2x80x9cdayxe2x80x9d radical and radical 104 is the xe2x80x9cmonthxe2x80x9d radical. When combined, the resulting Kanji character 100 means xe2x80x9copen.xe2x80x9d There is a well-known, standard set of 214 radicals that are referred to as xe2x80x9ctraditional radicals.xe2x80x9d FIGS. 2A and 2B depict the set of traditional radicals 200. Within the set of traditional radicals 200, each radical is enumerated from 1-214 with alternative drawings indicated with either parenthesis or brackets (e.g., xe2x80x9c(32)xe2x80x9d).
Some conventional computer systems for recognizing Kanji handwriting have focused on recognizing traditional radicals in order to recognize a Kanji character. This technique is known as xe2x80x9cradical recognition.xe2x80x9d These conventional systems have attained higher accuracy in recognizing Kanji characters over previous systems, and have reduced the amount of data that must be stored when performing Kanji character recognition. However, the conventional radical recognition approach suffers from a few drawbacks. First, it is difficult to determine which radicals of the traditional radicals should be used. Some of the traditional radicals are individual (xe2x80x9catomicxe2x80x9d) radicals and others are combinations of atomic radicals. Hence, a decision must be made whether to use the atomic radicals, the combination radicals, or both. A second drawback is that after the set of radicals is determined, each radical typically must be manually entered into a database and mapped onto the Kanji characters that utilize the radicals. This procedure is time consuming. The third drawback stems from the conventional approach being nonextensible. That is, the conventional approach cannot be used with non Kanji-based languages. Also, after the radicals are mapped onto the Kanji characters, if the system is to be extended to recognize new Kanji characters, the set of radicals and the set of Kanji characters that are recognized usually have to be augmented manually, which is a time consuming task. That is, the additional Kanji characters have to be entered manually into the system and associated with their component radicals. Augmenting the set of Kanji characters that are recognized is a likely possibility since there are over 500,000 Kanji characters and most Kanji handwriting recognition systems only recognize a few thousand. Based upon these drawbacks, it is desirable to improve conventional radical recognition systems.
The system described herein automatically defines a set of radicals to be used in a Kanji character handwriting recognition system and automatically creates a dictionary of the Kanji characters that are recognized by the system. As a result, the system described herein facilitates the development of Kanji handwriting recognition systems and attains a higher accuracy over conventional systems when recognizing Kanji handwriting. Additionally, the system described herein is fully extensible and can therefore be extended with little effort to recognize different languages. Moreover, if the system described herein is used for Kanji character recognition, it can be extended easily to recognize additional radicals and Kanji characters. In performing its functionality, the system described herein first obtains representative handwriting samples for each Kanji character that is to be recognized by the system. The system described herein then evaluates the samples to identify a set of subparts (xe2x80x9cradicalsxe2x80x9d) that are common to at least two of the Kanji characters. These radicals represent component roots (xe2x80x9cvisual componentsxe2x80x9d) from which the characters are formed. Each Kanji character is formed by one or more of these radicals. The radicals that are identified by the system described herein are not constrained to any preset definition (e.g., the traditional set of radicals). Thus, the radicals utilized by the system described herein may include some of the traditional radicals or may include none of the traditional radicals. After identifying the set of radicals, the system described herein generates a dictionary with a mapping of each Kanji character that is to be recognized by the system to its component radicals. After the set of radicals and the dictionary have been created, these components can be utilized during handwriting recognition. When performing handwriting recognition, the system described herein identifies the radicals within the handwriting and then uses the mapping to determine which Kanji character the handwriting most closely matches.
In accordance with a first aspect of the present invention, a method for generating radicals of Kanji characters is practiced in a computer system. This method provides for receiving sample handwriting data from at least one user comprising a plurality of Kanji characters with each Kanji character comprising at least one radical that is a common component of at least two Kanji characters. Further, the method provides for examining the sample handwriting data to automatically create a set of radicals from the sample handwriting data.
In accordance with a second aspect of the present invention, a computer system for recognizing Kanji characters is provided. In accordance with the second aspect of the present invention, the computer comprises an analyzer component for receiving sample handwritten data comprising a plurality of Kanji characters and for automatically defining a set of radicals from the sample handwriting data and a recognizer component for receiving handwriting user input indicating an intended Kanji character and for comparing the received handwriting user input to the set of radicals to determine the intended Kanji character.