One of the greatest challenges in the globalization of computer technologies is to properly handle the numerous written languages used in different parts of the world. The languages may differ greatly in the linguistic symbols they use and in their grammatical structures, and to support most, if not all, languages in various forms of computer data processing can be a daunting task. One critical step taken to facilitate the support of the different languages by computers is to provide a standardized coding system to uniquely identify every symbol in all the languages with a unique number. This coding system, called the Unicode, has been widely adopted by leaders of the computer industry, and is supported in many operating systems, modern browsers, and many other products.
A fundamental operation on textual strings consisting of symbols of a given language is collation, which may be defined as sorting the strings according to an ordering of the symbols that is culturally correct to users of that particular language. Anytime a user orders linguistic data or searches for linguistic data in a logical fashion within the structure of the given language, collation is used. Collation is a rather complex matter and requires an in-depth understanding of the language. For example, in English, a speaker expects a word starting with the letter “Q” to sort after all words beginning with the letter “P” and before all words starting with the letter “R”. As another example, in the Chinese language used in Taiwan, the Chinese block characters are often sorted according to their pronunciations based on the “bopomofo” phonetic system as well as the numbers of strokes in the characters. The proper sorting of the symbols also has to take into account variations on the symbols. Common examples of such variations include the casing (upper or lower) of the symbols and modifiers (diacritics, Indic matras, vowel marks) applied to the symbols.
The operation of collation is further complicated by the existence in many languages of special groupings of linguistic symbols that have to be treated as “sort elements” for purpose of linguistically correct sorting. For instance, in Hungarian, “DZS” is a unique combination that is sorted before “E” and after “DZ.” Such a special grouping of symbols as a sorting element is conventionally referred to as “compressions” (not to be confused with the usage of “compression” in the context of data size reduction). They are also sometimes referred to as linguistic “characters.” Within a given language, there may be several types of compressions (i.e., different numbers of symbols in the compressions). The highest type of compressions varies from language to language, and compressions as high as 8-to-1 are used in Bengali and Tibetan. The existence of compressions makes linguistic sorting more complicated, because for a given input string the sorting program has to determine whether some of the symbols in the string form a compression in order to properly sort the string. In other words, the sorting program has to recognize the language-dependent sort elements in the string. To further complicate the matter, some languages have large numbers of compressions. For instance, Tibetan and Chinese have about 10,000 and 30,000 compressions, respectively, that represent supplemental characters. Since the compressions have to be checked in a sorting operation to identify the sort elements in a textual string, the existence of a large number of compressions can make the sorting operation very time consuming.
The need to properly handle compressions becomes increasingly important as developers of computer software programs try to add support for many new languages that are more complex than those languages already supported. One significant difficulty encountered by the developers is that the existing framework for collation is unable to accommodate the much more complex compressions or the large numbers of compressions used in the new languages. For instance, operating systems traditionally support languages with compression levels no greater than 3-to-1, and the number of compressions in a give language is typically quite small, at most a few tens. The new languages to be supported, however, use compressions with higher compression levels up to 8-to-1, and some of them have tens of thousands of compressions. The existing framework for providing the collation functionality, being developed to handle much lower levels of compressions and much smaller number of compressions, cannot cope with the new compressions presented by the new languages. Moreover, attempts to extend the existing architecture would likely result in un-maintainable code that is complex and difficult to debug. Accordingly, there is a need for a new architecture for providing collation functionality that can effectively support the new languages.