There is a need for collation of distinct sets of information in a processing system. As an example, the computer industry has become increasingly internationalized over the past decades. This expansion outside of the borders of the United States has been driven both by the increasing technological sophistication of foreign countries as well as the growth of large scale computer networks over which information is transferred by private individuals and multinational corporations alike. The increased global use of computer systems, and especially personal computer systems, has led to the wide-spread sale of U.S. developed operating systems which were originally developed for users who understand the English language.
For many "Made in the USA" systems, the software may only be able to handle the English letters A-Z and a-z, and may not be able to handle characters in other languages. Additionally, numeric and monetary formats typically use American conventions. Typically, software written for a United States vendor often exhibits a behavior that is biased toward the English language because that is what is hard-coded into the program logic. This usually is fine for American users, but it is not acceptable for many computer users around the world.
Americans are not alone in their tendency to produce software that is biased toward a particular culture. A German software package may produce program messages in German only. A Japanese package may handle Japanese text easily, but be unable to process other languages. Throughout the world, programmers write software that addresses local requirements. The problem comes when the users are not local.
While software that is biased toward a particular culture has never been an ideal solution for international users, several trends are making it less and less acceptable. These trends range from economic considerations to changes in system functionality to the ever-increasing use of computers in everyday life. While most computer programmers produce software that is tuned to their local needs, many computer companies sell into much more than local markets. Indeed, they may do business all over the world. Software that is biased toward a particular culture is particularly troublesome when it is desirable to connect sites from various markets. Computer networks are stretching to include different cities and even countries. However, this new functionality cannot work correctly if the different sites have made conflicting changes in software to meet local user's needs.
Code sets have developed in an effort to address part of this problem. The most popular standard sets are the ISO 8859 series. ISO 8859-1 (Latin-1) covers Western European languages; ISO 8859-2 covers Eastern European languages; ISO 8859-3 covers Southeastern European languages; ISO 8859-4 covers Northern European languages; ISO 8859-5 covers English & Cyrillic-Based languages; ISO 8859-6 covers English & Arabic; ISO 8859-7 covers English & Greek; ISO 8859-8 covers English & Hebrew; ISO/IEC 8895-9 covers Western European & Turkish; and ISO/IEC 8859-10 covers Danish, English, Estonian, Faeroes, Finnish, German, Greenlandic, Icelandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish.
Most code sets and encoding methods each support one language or a group of related languages. However, this method will be insufficient if the need for the blend of languages is more exotic. For example, the combination of French and Arabic--a common mix in Northern Africa--is a problem because one requires ISO 8859-1 (Latin-1), while the other requires ISO 8859-6. A partial solution has been an effort to combine all characters into a universal code set. The idea of a universal set is to combine every character for all commonly used scripts and languages, as well as all the symbols one would need, in one large code set called Unicode. Unicode is explained in The Unicode Standard, World Character Encoding, Version 1.0, Volume 1, the Unicode Consortium, Addison-Wesley Publishing Company, Inc., 1990. For further background information regarding internationalizational issues in programming, see Sandra Martin O'Donnell, Programming for the World, A Guide to Internationalization, PTR Prentice Hall, 1994.
The need for a system to facilitate various languages causes special problems for collation and comparison. In dealing with the field of collation and comparison, it is perhaps as important to identify the problems and organize an approach to the problems as it is to solve the problems themselves.
In order to facilitate an understanding of the related art, a brief discussion of the terms used in the field is helpful. In the world of computer standards, "collation" is usually used to refer to language-dependent ordering of strings, while "comparison" is generally used to refer to simple non-language-dependent ordering (e.g. by code order). Collation or comparisons can be applied to a "string" which is a sequence of characters. A "script" is a complete repertoire of related characters (usually letters) while "pseudo scripts" include non-letter items such as punctuation, symbols, and digits. A system used for writing will use subsets of combinations of scripts and pseudo scripts. A "repertoire of characters" is a subset of different scripts. "Ignorables" include, but are not limited to, items such as hyphens and spaces which are mostly insignificant for collation.
"Levels of significance" indicate the order of the different levels of inequality that the system checks. For instance, the first level may be an identification that "a" is different from "b"; the second level may be that "a" is different from "a"; the third level may be that "a" is different from "A"; the fourth level may be to identify differences between two strings which include ignorables.
"Expansion" refers to single characters which must be sorted as two or more characters; an example of which is being sorted as a, e (unless the language treats it as a single letter, such as the Danish language). "Contraction" refers to multiple letters being treated as a single letter; an example of which is ch or ll being treated as a single unit in Spanish. A "text element" is a grouping of characters for a particular text process such as collation. Finally, a "diacritic" is a mark added to a letter that usually provides information about pronunciation or the stress that should be given to a syllable. Examples include accents and diaereses.
At first glance, collation may seem a simple task: given some sorting order for characters, walk through two strings to be compared until non-identical characters are found, then order the strings by a sort order of those characters. In fact, collation is much more complex. Even proper English sorting for a typical 8-bit character set (such as Latin-1) involves three levels of significance, ignorable characters, and expansion of some characters into multiple elements.
To illustrate the complexity of international collation issues, an overview of some collation issues for different languages is helpful.
Latin/Roman script languages
The first column below shows how a dictionary would sort the following words; the second column shows the results of a naive sort based on Latin 1 code order.
______________________________________ Dictionary Single level ordering computer ______________________________________ .cedilla.a Cooper coop Coors co-op co-op Cooper co-opt co-opt coop coordinate coordinate Coors o'er DIPUS z o'er .cedilla.a z DIPUS ______________________________________
The problem is not just that the code values for the characters are not assigned in proper collating order, there is virtually no possible assignment of characters to collating positions that will produce the correct result with a single-level ordering. What is needed is a multi-level ordering with ignorables and expansion: First try to order based on primary differences (c.noteq.d); if there are no primary differences, then consider secondary differences (c.noteq..cedilla.); if there are no secondary differences either, then consider tertiary differences (c.noteq.C). In addition, certain characters should be ignored completely (e.g. `-`, "') unless they are the only difference between words. Finally, some characters should be expanded (at the primary level, `` should be treated as `OE`).
French adds an interesting twist. When processing accents as secondary differences, strings are compared from the end to the beginning. This produces the differences shown below:
______________________________________ Incorrect (compare accents Correct (compare accents from start) from end) ______________________________________ cote cote cote cote cote cote peche peche peche peche ______________________________________
In other languages, more than one character may be treated as a single unit for collation: traditional Spanish sorting treats `ch` as a single letter that comes after `c`, and treats `ll` as a single letter that comes after `l`; it also treats `n` as a unique letter that comes after `n`. This can produce sorting like: cz, ch, da, lz, ll, ma, na, nz, na. In some languages, letters with diacritics have a sorting position completely different from the letter without diacritics. In Danish, for example, the following are treated as letters that sort after z: oe, .o slashed., .ang..
More sophisticated sorting may treat "St." as Saint or Street depending on context, may treat McConnell as MacConnell, etc. This can require some semantic analysis. Japanese.
Main body text in Japanese typically intermixes Hiragana (phonetic syllable characters) and Kanji (Chinese characters). The pronunciation of the Kanji depend on context--how they are being used. For example: ##STR1##
The Kanji are underlined; the other characters in large print are Hiragana. The Kanji should be sorted as if they were replaced by the Hiragana characters that represented their pronunciation; in this example these Hiragana characters are shown above the corresponding Kanji. A romanized version of the pronunciation is shown below.
One important point is that the Kanji character in the fourth and sixth positions is pronounced differently in the two places it is used in the above example, and sorting should use the correct pronunciation in each case. Most Kanji characters have multiple pronunciations that depend on context. This requires either saving the correct pronunciation of each Kanji when it is first entered, or performing a morphological analysis on the text to determine the correct pronunciation if saved phonetic information is not available.
When two different Kanji have the same pronunciation, then a secondary sorting rule is used: The Kanji are sorted according by radical and/or stroke (see description of Chinese sorting below).
Japanese also uses Katakana phonetic syllable characters. The Katakana set includes a vowel extender character; different sorting variants may either (1) treat this as an ignorable character or (2) treat this as if were the Katakana character that represents the vowel of the preceding Katakana character: ##STR2## Chinese
Unlike Kanji in Japanese, the standard Chinese character sorting only depends on information that can be easily derived from each character. No grammatical analysis is necessary for Chinese for collation purposes.
Korean
Korean is mainly written using a set of alphabetic characters called Jamos. These are grouped into Hangul syllables that consist of a simple or complex leading consonant (choseong), a vowel (jungseong), and optionally a simple or complex trailing consonant (jongseong). Each syllable is usually written as a single block containing its constituent jamos. Korean text may be encoded using only Jamos, or using codes for composed Hangul syllables, or both. Hangul syllables are compared as units according to their constituent jamos. The leading consonant has primary significance; the vowel has secondary significance; and a trailing consonant (if present) has tertiary significance.
Korean text may also include some Chinese characters, called Hanja in Korean. These are typically compared using one of the standard Chinese methods: radical-stroke, stroke-radical, etc.
Arabic
Most Arabic words are derived from three-consonant roots that represent a general concept of an action or state. Various nouns, verbs, and other words related to this general concept are derived by changing vowels, doubling consonants, adding prefixes, or suffixes, etc. Short vowels are generally not written; they are generally inferred from the context (when they are written, as in religious or children's literature, they are written as marks above and below the main text; however, there are also other marks which are not vowels).
The following example shows some words derived from the k-t-b root, which has to do with writing (the short vowels in "tib", "kit", and "kataba" are not written): ##STR3##
The primary level of significance for sorting is the three-consonant root. To find this from source text may require morphological analysis to strip articles, normalize inflected forms, and reduce words to their root. Short vowels are ignored in this phase, except for the information they may contribute to morphological analysis.
If there are no differences at this primary level, additional levels of significance may consider the original text, short vowels (which may be filled in by morphological analysis), etc.
Thai
For purposes of collation, Thai can be considered as a sequence of consonant clusters consisting of a consonant, an optional vowel, and an optional tone mark. Vowels are either leading vowels (which occur before the consonant) or trailing vowels (which occur after the consonant). Thai strings should be compared cluster by cluster. For each comparison, the consonant has primary significance, the vowel is secondary, and the tone mark is tertiary.
Indic scripts
The Indic scripts are also collated using consonant clusters, as with Thai. Tibetan, Burmese, and Khmer (and possibly others) have some additional complexities.
Character encoding issues
Latin letters with diacritics may be encoded in several ways, depending on the character set: as single composed characters (e.g. Lain-1) as a base letter followed by combining diacritical marks (e.g. Unicode), as a base letter preceded by combining diacritical marks (e.g. ISO 6937), or as some combination of these.
The routines used in the current systems in the art cannot generally compare strings in different character sets in a meaningful way.
The current existing collation systems typically only support three levels of significance, but many languages have more. Grouping of at least three characters is common in many languages, and Indic scripts may require larger groupings. On the other hand, a single ligature character may need to be broken into many component characters for comparison. A system capable of performing these various functions is needed.
The current systems in the art do not facilitate the ability to collect various pieces of information from multiple locations to produce a desired collation order. The current systems in the art typically require the collation information of a particular language or region to be located in a single location. This requirement poses a serious problem in modern times as the need for language systems such as Unicode grow. Since Unicode contains information regarding virtually every character used in virtually every language, it would require one huge databank of collation information for the current systems in the art to access it. However, since no one source could reasonably compile accurate information regarding all collation orders of all languages, it presents a serious limitation.
Accordingly, what is needed is a system and method for accurate and efficient collation for distinct sets of information in a processing system. More particularly, what is needed is a system and method for accurate and efficient collation for a wide variety of languages. The present invention addresses such a need.