1. Field of the Invention
This invention relates to the technologies of computer displays and interpretation of file and data for display on a computer. This invention especially relates to the technologies of bi-directional display methods for displaying portions of data which require orientation from left-to-right and from right-to-left to support various international character sets and languages.
2. Description of the Related Art
Prior to the introduction of rich encoding schemes such as Unicode and ISO10646, most text streams consisted of characters originating from a single script. Traditionally an encoding was comprised of one national script plus a subset of the Latin script (ASCII 7) which fit within the confines of an 8 bit character type. In such an environment, presentation of text is a relatively trivial matter.
For the most part, the order in which a program stores its characters (logical order) is equivalent to the order in which they are visually presented (display order). Thus, there is a direct correlation between the logical order and display order. Exceptions to this rule include scripts which are written from right to left, such as Arabic, Hebrew, Farsi, Urdu, and Yiddish.
One existing method to solve this problem is to require computer users, such as computer programmers or web browser users, to enter characters in display order. This is no problem for users of left-to-right languages. However, for users of right-to-left languages, this requires the user to enter the characters and words in “reverse order”. For example, to create a text stream containing Arabic characters, the user must enter them backwards.
This solution is not elegant, and it becomes cumbersome when right-to-left and left-to-right scripts are intermixed, creating bi-directional scripts.
Another solution known in the art is to allow users to enter text in logical order, but to require them to use some explicit directional formatting codes within the script, for example, 0x202B and 0x202A in Unicode, for segments of text that run contrary to the base text direction. As this is acceptable in some instances, it has problems in practice, as well. First, it is undefined what a computer should do with the explicit control codes in tasks other than displaying the script. This may cause problems when these formatting codes are received by searching algorithms, or when they are interchanged between systems.
These explicit formatting codes require specific code points to be set-aside for them, as well. In some encodings, this may be unacceptable due to the fixed number of code points available and the number of code points required to represent the script itself.
Ideally, a system of encoding mixed direction scripts would maintain the flexibility of entering characters in logical order while still achieving the correct visual appearance and display order. Such algorithms do exist, and are called “implicit layout algorithms”.
Implicit layout algorithms require no explicit directional codes nor any higher order protocols. These algorithms can automatically determine the correct visual layout by simply examining the logical text stream. Yet in certain cases correct layout of a text stream may still remain ambiguous. Consider the following example in TABLE 1 in which Arabic letters are represented by upper case Latin characters.
TABLE 1Ambiguous layoutfred does not believe TAHT YAS SYAWLA I
In the absence of context, such as a base or paragraph direction, there are two possible ways to display the sentence. When displayed from left to right, it appears as “Fred does not believe I always say that”, and when displayed from right to left, it appears as “I always say that Fred does not believe”. As evident from this example, the two interpretations can represent completely different meanings, and may give no clue whatsoever that there has been an error in the display of the script.
The Unicode Bi-directional Algorithm rectifies such problems by providing a mechanism for unambiguously determining the visual representation of all raw streams of Unicode text. The algorithm is based upon existing implicit layout algorithms and is supplemented by the addition of explicit directional control codes.
Generally the Unicode implicit rules are sufficient for the layout of most text streams. However, there are cases in which the Unicode algorithm may give inappropriate or inaccurate display results. For example, a telephone number appearing in a stream of Arabic letters “MY NUMBER IS (321)713-0261.” This should not be rendered as a mathematical expression as show in TABLE 2. As demonstrated, without knowledge of the use of the numbers in this context, the correct display cannot correctly be determined.
TABLE 2Rendering numbersIncorrect display:0261-713 (321) SI REBMUN YMCorrect display:(321) 713-0261 SI REBMUN YM
Various implementations of the Unicode Bi-directional Algorithm have been proposed in technical reports, such as Unicode Technical Report #9, including “Pretty Good Bidi Algorithm” (PGBA), “Free Implementation of the Bidi Algorithm” (FriBidi)], “IBM Classes for Unicode” (ICU), Java 1.2, Unicode Java Reference, and Unicode C Reference.
Currently, there exist two reference implementations of the Unicode Bidirectional algorithm, one in Java and the other in C, as well as printed textual descriptions contained in technical reports such as Unicode Technical Report #9.
Upon our testing of the reference implementations of the Unicode Bidirectional algorithm on a large number of concise and carefully crafted test cases of basic bidirectional text, several problems and ambiguous results are found.
To simulate Arabic and Hebrew input/output, a simple set of rules can be utilized. These rules make use of characters from the Latin-1 character set. The character mappings allow Latin-1 text to be used instead of real Unicode characters for Arabic, Hebrew, and control codes. This is an enormous convenience in writing, reading, running and printing the test cases. This form is the same as the one used by the Unicode Bidirectional Reference Java Implementation, as shown in TABLE 3.
Unfortunately not all the implementations adhere to these rules in their test cases. To compensate for this, changes were made to some of the implementations.
TABLE 3Bidirectional character mappingsTypeArabicHebrewMixedEnglishLa–za–za–za–zALA–ZA–MRA–ZN–ZAN0–95–9EN0–90–40–9LRE[[[[LRO{{{{RLE]]]]RLO}}}}PDF{circumflex over ( )}{circumflex over ( )}{circumflex over ( )}{circumflex over ( )}NSM~~~~
In the Unicode C reference implementation, additional character mapping tables were added to match those of the Unicode Java Reference implementation. Also the bidirectional control codes were remapped from the control range 0x00–0x1F to the printable range 0x20–0x7E. This remapping allowed test results to be compared more easily.
In PGBA and FriBidi, the character attribute tables were modified to match the character mappings outlined in TABLE 3. However, the strategy we used for evaluation of ICU and Java was slightly different. In the ICU and Java test cases, the character types are used rather than a character mapping. So, in places where our test cases required a specific type, that type was simply used rather than a character mapping.
The test cases employed are presented in TABLES 4 through 7. The “source” column of each table shows the test case script input and a test case number, and the “expected” column sets forth what the correct display order output should have been.
TABLE 4Arabic Charmap TestsSourceExpected1car is THE CAR in arabiccar is RAC EHT in arabic2CAR IS the car IN ENGLISHHSILGNE NI the car SI RAC3he said “IT IS 123, 456, OK”he said “KO ,456 ,123 SI TI”4he said “IT IS (123, 456), OK”he said “KO ,(456 ,123) SI TI”5he said “IT IS 123,456, OK”he said “KO ,123,456 SI TI”6he said “IT IS (123,456), OK”he said “KO ,(123,456) SI TI”7HE SAID “it is 123, 456, ok”“it is 123, 456, ok” DIAS EH8<H123>shalom</H123><123H/>shalom<123H>9HE SAID “it is a car!” AND RANNAR DNA “!it is a car” DIASEH10HE SAID “it is a car!x”NAR DNA “it is a car!x” DIAS AND RANEH11−2 CELSIUS IS COLDDLOC SI SUISLEC −212SOLVE 1*5 1 − 5 1/5 1 + 55 + 1 5/1 5 − 1 5*1 EVLOS13THE RANGE IS 2.5..55..2.5 SI EGNAR EHT14IOU $1010$ UOI15CHANGE - 10%%10- EGNAHC16- 10% CHANGEEGNAHC %10-17he said “IT IS A CAR!”he said “RAC A SI TI!”18he said “IT IS A CAR!X”he said “X!RAC A SI TI”19(TEST) abcabc (TSET)20abc (TEST)abc (TSET)21#@$TESTTSET $@#22TEST 23 ONCE abcabc ECNO 23 TSET23he said “THE VALUES ARE 123,he said ”KO ,789 ,456 ,456, 789, OK“123 ERA SEULAV EHT”.24he said “IT IS A bmw 500, OK.”he said “A SI TI bmw KO ,500.”
TABLE 5Hebrew Charmap TestsSourceExpected1HE SAID “it is 123, 456, ok”.“it is 123, 456, ok” DIAS EH2<H123>shalom</H123><123H/>shalom<123H>3<h123>SAALAM</h123><h123>MALAAS</h123>4−2 CELSIUS IS COLDDLOC SI SUISLEC −25-10% CHANGEEGNAHC -10%6TEST ~~~23%%% ONCE abcabc ECNO 23%%%~~~ TSET7TEST abc ~~~23%%% ONCEabc ECNO abc ~~~23%%%abcTSET8TEST abc@23@cde ONCEECNO abc@23@cde TSET9TEST abc 23 cde ONCEECNO abc 23 cde TSET10TEST abc 23 ONCE cdecde ECNO abc 23 TSET11Xa 2 ZZ a 2X
TABLE 6Mixed Charmap TestsSourceExpected1A~~~~A2A~a~a~~A3A11A4A11 A5A~11~A6117a1a18N11 N9A~~ 11 ~~A10A~a1a1~A11N11N12a1a113A~N11N~A14NOa1a1ON151/2½161, 21, 2175, 65, 618A1/22/1A19A1, 51, 5A20A1, 21, 2A211, .21, .2221, A22A, 123A5, 15, 1A24+$1+$1251+$1 + $265 + 15 + 127A + $11$ + A28A1 + $$ + 1A291 + /21 + /2305+5+31+$+$32N + $1+$1N33+12$+12$34a/1a/1351, 51, 536+5+5
TABLE 7Explicit Override TestsSourceExpected1a}}}defafed2a}}}DEFaFED3a}}}defDEFaFEDfed4a}}}DEFdefafedFED5a{{{defadef6a{{{DEFaDEF7a{{{defDEFadefDEF8a{{{DEFdefaDEFdef9A}}}deffedA10A}}}DEFFEDA11A}}}defDEFFEDfedA12A}}}DEFdeffedFEDA13A{{{defdefA14A{{{DEFDEFA15A{{{defDEFdefDEFA16A{{{DEFdefDEFdefA17{circumflex over ( )}{circumflex over ( )}abcabc18{circumflex over ( )}{circumflex over ( )}}abccba19}{circumflex over ( )}abcabc20{circumflex over ( )}}{circumflex over ( )}abcabc21}{circumflex over ( )}}abccba22}{circumflex over ( )}{abcabc23}{circumflex over ( )}{circumflex over ( )}}abccba24}}abcDEFFEDcba
All implementations were tested by using the test cases from TABLES 4 through 6. The implementations that support the Unicode directional control codes (LRO, LRE, RLO, RLE, and PDF) were further tested using the test cases from TABLE 7. At this time, the directional control codes are only supported by ICU, Java 1.2, Unicode Java reference, and Unicode C reference.
When the results of the test cases were compared, the placement of directional control codes and choice of mirrors was ignored. This is permitted as the final placement of control codes is arbitrary and mirroring may optionally be handled by a higher order protocol.
TABLES 8–10 detail the test result differences among the implementations with respect to the expected results. Only PGBA, FriBidi and the Unicode C implementations returned results that were different from the expected results; the Unicode Java reference, Java 1.2, and ICU passed all test cases.
TABLE 8aArabic Test Differences for PGBA 2.44he said “KO ,)456 ,123( SI TI”6he said ”KO ,)123,456( SI TI”121 + 5 1/5 1 − 5 5*1 EVLOS14$10 UOI15%-10 EGNAHC16EGNAHC %-1019abc )TSET(24he said ”A SI TI bmw 500, KO.”
TABLE 8b.Arabic Test Differences for FriBidi 1.122SI RAC the car NI ENGLISH7”ok ,456 ,123 it is” DIAS EH8<123H>shalom</123H>9DIAS EH “it is a car!” DNA RAN10DIAS EH ”it is a car!x” DNA RAN11-SI SUISLEC 2 COLD1510- EGNAHC%16-10% CHANGE19(TSET) abc21#@$ TEST22ECNO 23 TSET abc
TABLE 8c.Arabic Test Differences for Unicode C Reference7”ok ,456 ,123 it is” DIAS EH11DLOC SI SUISLEC 2–12
TABLE 9Hebrew Test DifferencesPGBA 2.4FriBidi 1.125EGNAHC %- 106abc ECNO %%%23~~~TSET7abc ECON %%%23~~~abc TSET11Z 2 aXa 2X
TABLE 10Mixed test differencesPGBAFriBidi 1.121A~~2~a~A~Aa~101a~A~Aa1141a~A181/2A1/2A195.1A212, 1231, 5A27+$1A281 + $A3215N355, 1
In the PGBA reference implementation, types AL and R are treated as being equivalent. This in itself does not present a problem as long as the data stream is free of AL and EN (European number). However, a problem arises when AL is followed by a EN. For example, test case 18 from TABLE 6. In this situation, the ENs should be treated as AN's (Arabic number) and not left as EN's.
The handling of NSM is also different in PGBA. PGBA treats NSM as being equal to ON (other neutral). This delays the handling of NSM until the neutral type resolution phase rather than in the weak type resolution phase. By delaying their handling, the wrong set of rules are used to resolve the NSM type. For example, in test case 2 from TABLE 6 the last NSM should be treated as type L instead of type R.
There are a few problems with the FriBidi implementation, as well. Specifically, when an AL is followed by a EN the EN is not being changed to type AN. See test case 18 in TABLE 6. This is the same symptom as was found in PGBA, but the root cause is different. In FriBidi, step W2 (weak processing phase rule two) the wrong type is being examined it should be type EN instead of type N. Additionally, there is a problem in determining the first strong directional character. The only types that are recognized as having a strong direction are types R and L. Type AL should also be recognized as a strong directional character. For example, when test case 1 from TABLE 6 is examined FriBidi incorrectly determines that there are no strong directional characters present. It then proceeds to default the base direction to type L when it should actually be of type R. This problem also causes test cases 2, 9, and 11 from TABLE 4 to fail.
The greatest hindrance to the creation of a method for converting logical data streams to display streams lies in the problem description. The problem of bidirectional layout is ill defined with respect to the input(s) and output(s).
Certainly the most obvious input is the data stream itself. Several situations require additional input in order to correctly determine the output stream. For example, in Farsi mathematical expressions are written left to right while in Arabic they are written right to left. This may require a special sub input (directional control code) to appear within stream for proper handling to occur. If it becomes necessary to use control codes for obtaining the desired results the purpose of an algorithm becomes unclear.
The situation becomes even more cloudy when one considers other possible inputs (paragraph levels, line breaks, shaping, directional overrides, numeric overrides, etc.) Are to be treated as separate inputs? If they are treated as being distinct, when, where and how should they be used? Determining the output(s) is not simple either. The correct output(s) is largely based on the context in which an algorithm be used. If an algorithm is used to render text, then appropriate outputs might be a glyph vector and a set of screen positions. On the other hand, if an algorithm is simply being used determine character reordering, then an acceptable output might just be a reordered character stream.
The Unicode Bidirectional algorithm has gone through several iterations over the years. The current textual reference been greatly refined. Nevertheless, we believe that there is room for improvement. Implementing a bidirectional layout algorithm is not a trivial matter even when one restricts an implementation to just reordering. Part of the difficulty can be attributed to the textual description of the algorithm. Additionally there are areas that require further clarification.
As an example consider step L2 of the Unicode Bidirectional Reference Algorithm. It states the following, “From the highest level found in the text to the lowest odd level on each reverse any contiguous sequence of characters that are at level or higher.” This has more than one possible interpretation. It could mean that once the highest level has been found and processed the next level for processing should one less than the current level. It could also be interpreted meaning that the next level to be processed is the next lowest level actually present in the text, which may be greater one less than the current level. It was only through an examination of Unicode's Java implementation that we were to determine the answer.
There are also problems concerning the bounds of the Uni-code Bidirectional Algorithm. In the absence of higher order protocols it is not always possible to perform all the steps of Unicode Bidirectional Algorithm. In particular, step L4 requires mirrored characters to be depicted by mirrored glyphs their resolved directionality is R. However, glyph selection requires knowledge of fonts and glyph substitution tables. One possible mechanism for avoiding glyph substitutions is to perform mirroring via character substitutions. In this approach mirrored characters are replaced by their corresponding character mirrors. In most situations this approach yields the same results. The only drawback occurs when a mirrored character does not have its corresponding mirror encoded in Unicode. For example, the square root character (U221A) does not have its corresponding mirror encoded.
Such situations have placed developers in a quandary. One solution is to use the implementations (Java and C) as a reference. But these implementations don't agree in every case. Furthermore the implementations have different goals. The Java implementation follows the textual reference closely while the C implementation offers performance improvements.
However, if computer source code is to be used as a reference design, then source code that is more attuned to describing these types of methods and algorithms is required. The flexibility, extensibility, and understandability of the imperative language references causes these references to be inadequate.
For example, using the imperative language reference, it matters what character encoding one uses (UCS4, UCS2, or UTF8). In “C”, the size of types are not guaranteed to be portable, making C unsuitable as a reference. In the Java, reference implementation the ramifications of moving to UCS4 are unclear.
Therefore, there is a need in the art for a new reference method for bidirectional text script interpretation for display, which avoids the errors in interpretation of the existing references, as well as provides a framework upon which future, improved models may be realized. Preferably, the new method should separate details that are not directly related to the method such that text and character reordering is completely independent from character encoding.