The ideographic character set used by the Chinese, Japanese, and Koreans to represent in a written form their spoken languages, consists of many thousands of characters. In fact, one of the characteristics of the character set is the capability of combining two or more characters in order to form a new character. Nevertheless, each character is comprised of one or more of five fundamental stroke shapes, which are :
These fundamental strokes may be apparent in variations thereof from simple to complex and from a non-serifed to serifed, such as:
In order to construct Oriental characters on a cathode ray tube or an output device such as a printer, it is necessary to first develop a dictionary of characters, and secondly to encode the characters in a form that is compatible with the output device. One of the earlier attempts at automating a character set is found in the Japanese typewriter, which has upwards of six hundred (600) keys with each key representing a unique character. More recently, various computer manufacturers have placed on the market character generators capable of producing a family of characters. These systems have utilized either a large keyboard similar to the Japanese keyboard, or a number pad wherein the operator utilizes a plurality of number codes, each representing a Japanese character. While these systems are operable, they require a great deal of experience to operate and further, the number of key strokes, particularly in the number pad system, may be excessive. For example, a character set having fifty thousand (50,000) characters would require at least five key strokes on a ten-key keyboard in order to obtain any single character. Reduction of the number of characters under the five figure mark, that is, below ten thousand (10,000) characters, may severely limit the use of the system to rather simple and non-technical applications. It is for this reason that many of the present systems have proved inadequate.
Another problem presented in automating the production of an Oriental character system is the lack of a standard hierarchical order of the characters. There are several schemes presently known to "sort" the character set, thereby facilitating dictionary lookup. One of the more popular systems used by some libraries is the "four corner system" wherein the characters are described by numbers assigned to certain stroke configurations such as set forth above and found at each corner of the generally square character. Even though this system is in use, the system does produce numerous ambiguous codes. A simpler system assigns two digits to each of three corners and is known as the "three corner system." This system permits approximately one hundred thousand (100,000) possible characters to be encoded with six-digit numbers. Nevertheless, it has been found that approximately ten percent (10%) of the numbers turn out to be ambiguous. The ambiguities plus the large size of the input string (six characters) would slow an automated operation excessively.
A two corner system utilizes a scheme similar to the aforedescribed four corner numbers plus a phonetic sign derived from the pronunciation of the characters. Nevertheless, some ambiguities remain, although the number of key strokes per character may be reduced to approximately three with two shift keys. The difficulty with this system is that the operator must be familiar with the phoneticized dialect in order to operate the system.
For many years, the Orientals have used the "telegraph code" which is an arbitrary assignment of numbers from zero (0) to 9,999 to various characters in the dictionary. It can be readily seen that the vocabulary is limited to the 9,999 characters, and further, the encoding of the message may pose serious problems when technical terms or the like must be transmitted. This code is used, as the name implies, in the telegraph system in the Orient.
Various other systems have been developed based on phonetics using an English keyboard or a larger keyboard controlled with numerous shift keys. Again, these pose problems to one who is not familiar with the pronunciation and the phonetics of the character set.
More recently, a component method has been developd. In this system, the characters are built up from simpler elements. The basic building blocks are the five basic strokes listed above. The problem with this system is the number of key strokes necessary for a character in order to develop an adequate vocabulary. In particular, in an automated system the various components tend to be distorted out of proportion when combining two or more components to make a final character. For example, a combination of three components without compression results in either horizontally or vertically elongated characters or a triangular shaped character. As well known, most of the Chinese characters sets are formed generally in a square shape, thus, the "component formed" character without some sort of manipulation of the individual components, is inadequate to the user from the aesthetic point of view and consequently, may be harder to read. Nevertheless, the component system has proved popular in recent years in categorizing or hierarchically sorting the Chinese character set.
Using the component system, as suggested above, permits the linguist to sort out basic components of the character by one of the hierarchical methods set forth above. In addition, a completely different method was suggested by two Russians, Rosenberg and Kolokolov, and applied in dictionaries by Oshanin. This classification system uses the five basic forms as set forth above as they occur in the lower right hand corner of the character. Thus, it could be likened to a "one corner" system. With a hierarchical system such as this, Chinese character systems may be readily sorted. As noted above, other schemes have been developed which also achieve this end; however, none has been universally accepted.
The actual structure of an ideogram such as a Chinese character is important to any system that develops or generates characters in some mechanical or electronic way. In particular, the Chinese character system is built of pictograms which represent, albeit somewhat fancifully, the item being described. For example, the Chinese character for an urn or tripod consists of a figure having four legs and a table-like top which is representative of the urn itself. A second type of character used by the Chinese is the ideogram, which may be a combination of two pictograms or two other ideograms. The third generalized form of Chinese character is comprised of a radical or root component that indicates the general semantic category of the word, for example, a plant, a tree, or a bird, and a phonogram or phonetic component that indicates the general pronunciation of the word and thereby specifies which member of the semantic category is being represented. Thus, if one looks at various types of birds such as the oriole, the chicken, or the seagull, the root component for a bird is found in each case. Unfortunately, the relative position of the components is unpredictable as is the pronunciation. Nevertheless, such complexities are not necessarily limiting to one generating such characters, particularly if the generation is from textual material presented to the operator. This situation is quite common in a library environment such as the cataloguing of books, or in the transmission of messages from one locale to another. In both instances, the clerical function of transcribing the textual matter to some sort of an automated device does not necessarily require that the operator have a vast store of knowledge about the makeup of the various characters. Consequently, it has been possible to devise and use a scheme or system for categorizing characters such as the "one corner system" suggested here. Once one becomes familiar with the hierarchical sort order of a basic character set, other characters can be developed therefrom.
While, on the surface, this appears to be a relatively simple task, when the task is automated it becomes more complex as the characters may be formed in many different ways. For example, one character may be above another, or one character may be to the left or the right of another. Furthermore, the character may be a representation of three or more other separate characters.
In the past, attempts to automate a Chinese character set on a computer have generally utilized a complete matrix of each character available for output. It is readily apparent that such an approach to an automated character set or means for generating the character set is very wasteful as far as the core storage in the computer is concerned. Accordingly, it is appropriate to utilize off-line storage in such systems. Therefore, a scheme which not only reduces the number of stored characters, but also reduces the space utilized to store characters is appropriate. Further, the reduction of dictionary size reduces access time to search and retrieve a particular stored character.
Finally, present systems fall short in that the compression of characters necessary to combine two or more characters has proved inadequate.
Accordingly, this invention is directed to overcoming one or more of the problems as set forth above.