As the use of computers proliferates around the world, so that peoples representing the vast majority of languages now regularly produce documents and carry on international communication using their computers and work stations, it is becoming of ever increasing importance that the information passed among speakers of different languages be mutually compatible with printing and display systems for rendering those languages. An international standard has developed, which, though not yet comprehensive, already covers most of the written alphabets in the world; this standard is The Unicode Standard/Worldwide Character Encoding (Addison-Wesley, ISBN 0-201-56788-1). (Each of the publications discussed in this application is incorporated herein by reference.) Unicode provides an encoding for each letter, diacritic, tone mark, or other special character for the languages that it covers. Further information on Unicode may be found in G. Adams, Introduction to Unicode, Unicode Implementers Workshop (Aug. 6, 1992) by the institute for Advanced Professional Studies, Cambridge, Mass. and proceedings of the Unicode Consortium/Unicode Implementers Workshops (Unicode, Inc. and Taligent), particularly the following workshop proceedings: Non-spacing Marks, Unicode Implementers Workshop #2 (Merrimack, N. H.; Mar. 12-13, 1993); and M. Davis, Strategies for Handling Non-spacing Marks and T. Yamasaki, Unicode on Print Servers, both from Unicode Implementers Workshop #3 (San Jose, Calif.; Aug. 6-7, 1992).
Another coding system available is UniversalString or 10646String, which includes the Universal Character Set Code ISO/IEC 10646, as described in ISO/IEC International Standard 10646-1 (1993), prepared by the ISO and IEC Joint Technical Committee ISO/IEC JTC1 under the general title Information technology--Universal Multiple-Octet Coded Character Set (UCS) (1993). The 10646 code set is in large part similar to Unicode's, and the shortcomings of the 10646 system are similar to those of Unicode. The discussion in this application pertains to both these and other such encoding systems.
When a computer system interprets a string of Unicode, 10646 or otherwise encoded characters, it performs a rendering process to display or print those characters. Three conventional rendering procedures use a kerning table, a look-up table and a ligature table, either separately or in some combination. The input to the rendering system is a stream of code points (i.e., the binary-coded representations of the characters), and the output is a glyph code for each input character code. A glyph is a representation of a character in a single display or print cell, and may be a combination of several potentially independent characters; for instance, following are seven different glyphs: EQU a a a a
The last of these (a) is represented by three code points: the code point for the "a", the code point for the umlaut, and the code point for underlining. In current systems, these three code points are combined and a single glyph is displayed.
When a look-up table is used, the rendering system compares the code point(s) with those in the table; if the particular code point combination is found, then the output is simply the glyph found at that entry of the look-up table.
The rendering system may additionally check a ligature table, to form ligatures of particular combinations of letters. Many languages (such as Arabic) have quite a few ligatures; English has only a few ligatures, such as for "fi", " " for "ffi", and " " for "fl". These ligatures in English are optional, while in other alphabets, the ligatures are a required feature of the written language. An analysis of computer treatment of rendering Arabic ligatures and similar problems is found in J. Becker, Multilingual Word Processing, Scientific American, July 1984 and in J. Becker, Arabic Word Processing, Communications of the ACM, July 1987 (vol. 30, number 7).
The rendering system may also check a kerning table, where it determines the separation of particular combinations of glyphs, i.e. the separation between characters as displayed or printed.
The above three systems can be used in combination to accommodate many languages. Latin-based alphabets are particularly simple to handle. However, many languages have complicated rules about combining letters, tone marks and other characters with one another, which are not well suited to these approaches.
Kerning and ligature tables are in most systems rather small, and unable to accommodate the thousands of possible combinations of characters that must be represented for even a single language; for instance, Thai has some 2700 possible character combinations, which would make a look-up table, a ligature table or a kerning table unacceptably large, and would occupy too much processor time to check each combination.
Similarly, Arabic letters can be combined in at least three different ways, having initial, medial and final forms, and others additionally have a fourth (isolated) form. These letters form complicated ligatures in the written language, with each of the different letter forms in general having a different shape. If a ligature table is built to accommodate them all, the table becomes very large, requiring many thousands of entries to store all the combinations possible of the 28-letter alphabet.
Other languages, such as Korean and Vietnamese, present similarly numerous and complex combinations of letters. Creating special tables specifically for languages with similar challenges hinders the standardization and size minimization of these tables, occupies too much memory, and requires a great deal of processor time for searching them. Thus, in a system that must process more than just a single written alphabet or character set--namely, virtually any system used for international purposes--it is not practical to use kerning and ligature tables with all possible combinations of letters in Arabic, Korean, Thai, Vietnamese, English, and so on. A workable international rendering system should be able to handle the variations in the display of characters in all of these systems without requiring a table entry for each combination of letters.
A different problem is presented when a user enters a character that is not specifically found in one of the tables. For instance, for some reason a user may wish to enter a y (i.e. a "y" with an umlaut or diaeresis)--which may not be defined for a given system--or create some other character, such as a non-Latin alphabet character with a Latin-style accent. An example of the latter would be (the Thai character "ko kai" with a circumflex on it), which is a combination that does not exist in any predefined alphabet. Such ad hoc characters cannot be handled by conventional systems, which, when encountering undefined characters, typically simply substitute a space or a default symbol for the unknown code points.
A system is needed that provides for user-created glyphs that are not already defined in the system's tables by analyzing the code points and rendering glyphs as nearly as possible. This should be done without creating large tables of special characters. In particular, a system is needed that can accommodate such large numbers of character combinations as found in Thai, Arabic, Korean, etc., while minimizing the sizes of the character tables such as ligature and kerning tables.
FIG. 1 shows a portion of a system for implementing conventional approaches to rendering characters. The rendering system 10 is an application resident in the memory of a central processing unit, and accesses a font resource 40 comprising kerning table 50, glyph look-up table 60 and ligature table 70. These tables are also stored in the memory. Characters encoded as code points 20 are input to the rendering system 10, which generates output glyphs 30 that depend upon how the code points map onto one or more of the tables 50, 60 and 70.
The code points 20, which are binary-coded representation of the characters, are input by a user or received from a file or other source of text. For Unicode, each code point constitutes a 16-bit (2-byte) word. The examples below will be in terms of Unicode, although any character encoding scheme may be used with the present invention.
Each of the three common procedures (corresponding to the tables 50-70) for handling incoming character streams has particular utility for certain languages. The rendering system matches the input code points to entries in the tables. For instance, the word " nds" may be input, which would be represented in Unicode by the following code points: ##STR1## The code point for "f" is "U+0066", the "U+" indicating that this is a Unicode code point, "0066" being the hexadecimal representation of the letter. Next comes a "J", followed by the second letter, "i". The "J" here represents a special Unicode-represented character meaning "join", indicating that the two letters should be joined together in a ligature: fi (no ligature) becomes (with ligature) for this example. The joining character "J", which is optional, might be generated automatically by the application in which the text element "finds" was originally produced, or it might be entered deliberately by the user. Following the "J" are the code points for the remainder of the word.
The rendering system 10 could be configured to handle this word using any of the tables 50, 60 and/or 70. For instance, it can first check the ligature table 70, then the look-up table 60, and finally the kerning table 50. A ligature such as " " is likely to be stored only in a ligature table, but in other alphabets it is likely that combinations of letters would be stored in any of the ligature, look-up and kerning tables. This is the case, for instance, for Thai, where letter combinations including vowels or tone marks are numerous.
The ligature table 70 represents a possible set of code points (CP6, CP9, . . . , CPx, CPy, CPz) that have been selected because they represent examples of ligatures for the particular alphabet in question. For instance, CP6 might represent an "f", and CP9 an "i", so that CP6-CP6-CP9 is the code point representation for "ffi". The rendering system locates this sequence in the table 70, and thus, instead of outputting the sequence "ffi", substitutes a replacement glyph " ". Other specific cases are stored in the ligature table.
In the above example, " nds" might be analyzed by first looking at the look-up table, and locating the letter "f". Then the system consults the ligature table 70 to see if there are any ligature beginning with "f", and locates an entry "CP6-CP9" (corresponding to the sequence 0066-0069), representing "fi". The joining character "J" indicates that a ligature is desired, so the glyph " " is output.
The next code points, representing the string "nds", are located in the look-up table 60, which for each code point includes a glyph shape and a glyph width. Ultimately, all of the input code points 20 have been output as glyphs 30.
Alternatively or in addition, the code points may be found in the kerning table 50, which is designed to handle spacing between predefined sequences of characters. For example, the spacing between the "f" and the "i" would be determined by locating the sequence "CP6-CP9" in the kerning table (ignoring the joining character). The tables may be used in combination, with the look-up table returning the glyph shapes and widths, and the kerning table returning the inter-character spacing.
From the above, it will be seen that the glyphs that the rendering system can return are limited by the sizes of the tables. Moreover, for glyphs formed by combining or joining two or more characters, the width/spacing approach does not optimize the glyph shapes; for instance a capital U (U with umlaut) might appear simply as a capital U with an overstruck umlaut: . While a ligature table provides shape optimization, as mentioned above none of these tables can accommodate the many thousands of existing possible letter combinations, much less the multitude of possible combinations that are not regularly used, but that a user might want to print for some special purpose (such as the combination mentioned above). A system is needed for handling these special cases without unduly increasing the sizes of the tables.