The Chinese, Japanese, Korean, and Vietnamese languages have traditionally used writing systems that employ thousands of characters of Chinese origin. In addition, scholars in Japan, Korea, and Vietnam created additional characters of native origin that resemble Chinese characters in design. These latter characters are referred to as kokuji (Japanese-origin), gugja (Korean-origin) and chunon (Vietnamese-origin) characters. Because Chinese-origin and Chinese-like (kokuji, gugja, and chunon) characters are so numerous and operate on different principles from Western phonetic alphabets, there has always been a need to classify them systematically. (For conciseness, Chinese-origin and Chinese-like characters will hereinafter collectively be referred to as “Chinese-type characters.”) In languages which still employ such characters—notably Chinese, Japanese, and Korean—that need continues to be felt today.
In one conventional technique, scholars have traditionally classified characters using a conventional set of character components known as radicals. Modern dictionaries today typically employ 214 radicals. The exact number of radicals employed, however, depends on the script type (simplified Chinese dictionaries sometimes list 227, 187, or 154), target audience (some modern dictionaries for normative speakers use fewer), and/or whether alternate radical forms are counted separately. The order in which radicals are listed in dictionary tables is determined by their stroke count, which is the number of pen strokes used to compose them. The order in which radicals having the same stroke count are listed is simply a matter of convention.
Radicals serve as a form of preliminary lookup key, roughly akin to the starting letter of a word in Western language dictionaries. To look up a character in a dictionary using the traditional radical system, the first step is to determine which portion of the character constitutes the radical, and then count the remaining strokes in the character. For instance, to look up , one first recognizes that it will be classified under the 2-stroke man radical . The next step is to count the number of remaining strokes. In this case the residual stroke count is 12. Finally, one searches the dictionary section containing man radical characters containing 12 residual strokes. The result will be a set of characters selected by the radical and residual-stroke-count search criteria. (Selections of characters resulting from a query are hereinafter referred to as a “search result set” or simply “result set.”) For the example just cited, one major dictionary has a search result set comprising 14 characters; the Unicode table of characters would yield a result set of over 40 characters.
One of the flaws of the radical system is that the number of characters selected by the combination of radical and residual-stroke-count as search criteria may often be large. (Note that some dictionaries use total-stroke count instead of residual-stroke-count, but for any given domain of characters the result sets are identical.) A second flaw is that many characters are not as straightforward as in the above example. Some guesswork may be needed to determine which portion of the character constitutes the radical: sometimes there is more than one apparent candidate, and sometimes there are no obvious candidates.
A second conventional approach to speeding the search for characters using the radical system has been to classify the radicals into the position within a character where they are found. Thus, for instance, the New Nelson Japanese Dictionary presents separate charts of radicals for those found on the left, right, top, and bottom of characters, respectively. This enables one to find a radical a little more quickly, though it has no effect on the number of characters referenced by that radical, nor does it help with cases where it is unclear which portion of the character may constitute the radical.
A third conventional approach to classifying Chinese-type characters, embodied in a dictionary by Hadamitzky & Spahn and designed mainly to help normative speakers, reduces the standard radical set by eliminating some of the less commonly used radicals and then placing characters traditionally classified under the now the eliminated radicals into some other radical group. While this approach may help people who struggle with some of the quirks of the classification system for rare radicals, it still does nothing to reduce the size of the search result set and in fact may increase that size.
In a fourth conventional approach to classifying characters, described in 2001 Kanji by Francis DeRoo, characters are found by looking at a set of approximate shapes for the top or top-left, and another set of shapes for the bottom or bottom-right, and thereby determining a number that corresponds to those shapes. This approach requires some skill to master, as not all of the gestalt shapes are readily apparent when compared with the actual character shapes in question. This approach has also only been developed for a small set of characters (2001 Japanese characters) and so is not readily adapted to larger sets of characters. Moreover, the lack of popularity of this system may attest to its flaws.
A fifth conventional approach, known as the Four Corners Classification, classifies characters according to the basic shapes of their corners, with various shapes being associated with one of the digits from 0 to 9. This method entails a high level of ambiguity in deciding which shape code to apply, and is extremely difficult to master. Its lack of popularity may also attest to these shortcomings.
A sixth conventional approach to classifying characters, also embodied in the New Nelson Japanese Dictionary, is to provide an intermediate table so that if a user guesses the wrong component for the radical, the user will still be redirected to the appropriate character. While this sort of cross-referencing helps alleviate some of the problems with guessing the correct radical where two candidates appear equally good, it does nothing to solve the problem where none of the character's components looks like one of the standard radicals, nor does it do anything to diminish the size of the search result set. It also may create a need for an intermediate stage in the search process, causing the user to expend more time.
A seventh conventional approach, found in many dictionaries, is to provide a list of characters ordered by their pronunciation, so that if a user does not know which radical to use as the key, the user can find it by its pronunciation. When native speakers do know the pronunciation of a character, they frequently use such indexes by reading for the simple reason that the radical system is often inadequate. Unfortunately, because of the large number of homonyms among Chinese-type characters, the number of characters selected by the system is often quite large, and so search time is still slow. Moreover, such an index is of little or no use when the user does not know how to pronounce the character. This can occur with both native and non-native speakers of the language.
An eighth conventional approach, found in software applications like KanjiLite, is to provide a chart of radicals in the form of a table. A user may click on one or more radicals in the table, and the returned selection will consist of characters containing the radicals selected. Unfortunately, this approach has not been applied outside of Japanese, and is of little or no help in cases where there is no apparent radical. Moreover, as in some of the above-mentioned methods, the search result set of characters may be quite large. Finally, there are many character components that do not constitute radicals, thus making this system inappropriate for all applications.
Various East Asian language input methods devised in recent years typically include input methods that attempt to map Chinese-like characters to a keyboard or a numeric keypad, and so unlike the present invention they cannot be used in non-electronic formats or contexts. Moreover, none of the input methods devised to-date employ the specific classification techniques provided by the present invention. It should be noted in brief, however, that input methods like CangJie, DaYi, and Boshiami are all based on the shape representation principle whereby a few dozen shapes are used to represent a large variety of character components (graphemes). Because such systems are unintuitive, they may require much time to master and are rarely used except by professionally trained typists.
To understand how the present invention overcomes the limitations outlined above, one must appreciate that while modern-day radical systems consist of approximately 200 character components (214 is standard), Chinese-type characters contain many more recurring components that have not been included in any version of the radical system. To date there has not been a particularly efficient way developed to categorize all of the recurring components found in Chinese-type characters, and lexicographers and linguists have never settled on a standardized set the way they have for radicals. However, because non-radical recurring components are generally less common than radicals, if used as a lookup key, they would lead to a far smaller selection of resulting characters. A handful of sources have at times listed non-radical components in a table. For instance, Chinese Characters by L. Wieger classifies many characters around non-radical recurring components. Unfortunately, he provides no convenient way to find characters using that method. Genealogy of Chinese Characters by R. Harbaugh attempts to classify recurring components as deriving from simpler radical forms. Unfortunately, this method suffers from the same irregularities and ambiguities as the radical system itself or the system used by DeRoo cited earlier. One obvious drawback to these approaches is that they are unwieldy, for simply finding a non-radical recurring component entails looking through many hundreds of components instead of only 214 (Japanese, Korean, traditional Chinese) or 224 (simplified Chinese), as is the case with the radical system. And so the time saved in reducing the size of the resulting selection is lost in finding the correct lookup key.
Embodiments of the present invention may address one or more of the above limitations by providing a way to find radicals, non-radical recurring components, and characters far more quickly than in any method previously devised. As a result, the invention makes it far easier to find characters in any system that incorporates this classification and lookup feature, and can therefore be used in a wide variety of electronic and non-electronic contexts, including dictionaries (both printed and electronic), lexical databases, and input methods. Further, an embodiment of the present invention allows one to combine multiple lookup keys when searching a character, thereby adding flexibility and ease of use for cases where determining the correct radical might be difficult, and aiding non-native speakers who might be at a loss as to how to find a character.
In accordance with one embodiment of the present invention, recurring components found in Chinese-type characters are identified, classified by stroke count, and then further classified by the number of free endpoints that they contain. Subsequently, Chinese-type characters are linked to a plurality of recurring components in the form of key-ordered pairs, taking into account the possibility of many-to-many relations (or relationships) among the characters and their constituent components. The result is an intuitive and highly efficient method, system, and/or software for classifying Chinese-type characters and their components in both electronic and non-electronic formats and applications that enables a user to easily find a target component and/or its associated characters.