To date, the most widely used code standard for alphanumeric characters has been ASCII (American Standard Code for Information Interchange) which is a 7-bit binary code standardized by ANSI (American National Standards Institute). As the only letters that ASCII supports are the English letters, its implementation in information processing and interchange environments has been limited to English. As a result, a large number of computer systems today communicate in the English language only.
In recent years, the computer industry has recognized the need to support the non-English Latin-based languages in order to facilitate communication with a non-technical user who often is familiar with only his native language. Hence, a new 8-bit multilingual character set was defined by ISO (International Standards Organization) in 1986. That set has already gained a broad support from the industry and various national standard organizations. The name of the character set is Latin Alphabet #1 and it has been documented in the ISO Standard as ISO 8859/1. It supports 14 Western European and Western Hemisphere languages that are used in 45 countries around the world.
The set of languages and characters supported by the ISO standard ISO 8859/1--"Information Processing--8 bit single byte coded graphic character sets Part Latin Alphabet #1" is believed to include most of those that are used in North America, Western Europe and Western Hemisphere. They are listed below:
Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. These languages are believed used in at least the following countries:
______________________________________ Argentina Finland Panama Australia France Paraguay Austria Germany Peru Belgium Guatemala Portugal Bolize Guyana El Salvador Bolivia Honduras Spain Brazil Iceland Surinam Canada Ireland Sweden Chile Italy Switzerland Colombia Liechtenstein The Netherlands Costa Rica Luxembourg UK Cuba Mexico USA Denmark New Zealand Uruguay Ecuador Nicaragua Venezuela Faroe Islands Norway ______________________________________
Returning now to the ASCII Character set, the main advantage embodied by the English language with regard to sorting is that the alphabetical order of the letters in the English alphabet corresponds to the internal numerical collating sequence in the ASCII set. This special feature makes the sorting of English language strings relatively simple and in most cases efficient.
For example, to sort two characters, the following operations are performed:
(1) Convert the cases of both characters into the same one (i.e. the characters become caseless). PA0 (2) Use straight comparison of codes (ordinal values) of both characters to determine the relative sort orders. The character whose ordinal value is smaller is collated first (in ascending order sorting). PA0 (1) Most, if not all sorting algorithms published so far assume that the underlying character set is the 7-bit ASCII set (or in some rare cases the EBCDIC set) which does not support foreign letters. As a result, these algorithms are not capable of sorting properly most non-English Latin-based languages. PA0 (2) The existing sorting methods for English and other languages cannot handle sorting properly when foreign letters are included. This should never happen if the computer system uses the national character sets which contain only letters in their languages. However, the problem of dealing with foreign letters in sorting does come up when 8-bit character sets are supported since those sets contain more letters than those that are used domestically. PA0 (3) The existing algorithms cannot properly handle sorting in a multilingual environment in which information from the same database can be accessed by users using different languages. PA0 (1) The collating sequence of letters in the Latin Alphabet #1 (or any other multilingual set) does not correspond to the alphabetical order of the letters in all the supported languages. This means sorting can no longer rely on the collating sequence imposed by the character set. PA0 (2) The main idea of sorting in a multilingual environment is to have data sorted in the user's own language. The data stored does not have to be necessarily in the user's language and, in fact, it can be made up of several different languages. Hence, a sorting operation is needed that is capable of supporting different sorting orders dependent on the users' languages. For example, the letter "A" is sorted after "Z" in Swedish whereas it is sorted the same as an "A / " in German. PA0 (3) In some languages, there are cases where letters with different internal representation are sorted as if they had the same representation (e.g. "V" and "W" in Swedish are collated the same). This undoubtedly creates a difficulty if one is thinking about using internal representation as a means to tackle the sorting problem. PA0 (4) The sorting software should be able to collate foreign letters correctly among the domestic letters. This kind of transliteration is definitely language dependent. PA0 (1) Language 1: PA0 (2) Language 2: PA0 (3) Language 3: PA0 (4) Language 4: PA0 (1) Characters that do not appear in a language should be sorted where users of that language might be expected to look for them. PA0 (2) In all cases, all punctuations and non-alphanumerical characters except blank are to be ignored if they appear among numerics and alphabetics (e.g. Ada/Bobby Co. is sorted as AdaBobby Co). If the name contains just punctuations and/or non-alphanumerical characters, then those characters should be preserved (e.g. ***, [*], /*/, etc.). In this case, these non-alphanumerical characters would be ordered before the digits and letters. PA0 (3) Sorting operations must support one-to-two substitutions for some characters. For example, the `.beta.` from language 3 is sorted as though it were "ss" in language 1. PA0 (4) Sorting operations must support two-to-one substitutions for some characters. For example, in Spanish the letter pairs `ch` and `ll` are sorted as if they were single letters and they are sorted between `cz` and `d` and between `lz` and `m` respectively. PA0 (5) Sorting operations must support accent priority. This means accented/non-accented letters are given different ordering when all the letters in the strings being compared are equal except for the accents (e.g. "Ellen" is collated before "Ellen"). For example, in English the `a` vowels (with or without accent) are treated as equal except for priority. Their priority order is: A A A A A A .ANG.. Note that priority among accents might vary between different languages. PA0 (6) The sort orders among non-alphanumerical characters (punctuations and symbols) are not expected to be language dependent. Hence, the language dependency of sorting would be determined by letters and accents.
Some limitations of this method of sorting, based upon the ASCII character set, include the following:
To handle multilingual sorting properly, the following issues should be addressed.
To give a better picture of the problems when multilingual character sets are supported, the sort orders of four illustrative languages are outlined below. Language 1 might be English, language 2 might be Swedish, language 3 might be German and language 4 might be French. Letters which have the same alphabetical order are enclosed in braces. Note that priority rules apply to those letters which are enclosed in braces and differ only by accent.
Lower case: a b c d e f g h i j k l m n o p q r s t u v w x y z PA1 Upper Case: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z PA1 Lower Case: a b c d {e e} f g h i j k l m n o p q r s t u {v w} x {y U} z .ang. a o PA1 Upper Case: A B C D {E E} F G H I J K L M N O P Q R S T U {V W} X {Y U} Z .ANG. A o PA1 Lower Case: {a a} b c d e f g h i j k l m n {o o} p q r s .beta. t {u u} v w x y z PA1 Upper Case: {A A} B C D E F G H I J K L M N {O o} P Q R S .beta. T {U U} V W X Y Z PA1 Lower Case: {a a a} b {cc} d {e e e,gra/e/ e e} f g h {i i i} j k l m n {o o} p q r s t {u u u u} v w x y z PA1 Upper Case: {A A A} B { C c}D {E E E E E} F G H {I I I} J K L M N {O O} P Q R S T {U U U U} V W X Y Z