As computer systems are deployed into more environments around the world, it becomes imperative that they support a wider range of interface languages. Many natural languages have complex sets of rules for displaying correctly-formatted written text. Furthermore, these rules (and their corresponding languages) may be unfamiliar to developers who produce software to deal with text, and to testers who confirm that the software is operating correctly.
When the written text is largely composed of familiar letter shapes and obeys rules that are not exceedingly complicated, development and testing can proceed relatively smoothly. For example, a German speaker may not be able to read Norwegian, but the alphabets share most characters, and written texts use common left-to-right character placement. Therefore, the German speaker could easily examine a sample of Norwegian text and pick out characters that seem to be oddly-shaped or out of place.
As the letter shapes become less familiar, and as writing rules become more complicated, it becomes increasingly difficult for one who cannot read the language to work on systems that are to process it. For example, the Cyrillic alphabet contains many characters unfamiliar to speakers of Latin-based alphabets, although most of the typesetting rules are unexceptional. Asian languages contain even more unfamiliar characters, and may also be written in vertical columns arranged from right to left.
Even further from left-to-right Latin alphabets along the spectrum of written text, Arabic and Indic languages use unfamiliar characters, often written right-to-left, and require that many characters be displayed differently depending on their position within a word (e.g., at the beginning, middle or end). Character shapes may also change according to neighboring characters. In typesetting parlance, these shape-changing characters are called “contextual forms.” In a simple example familiar to readers of Latin-based alphabets, when the two characters ‘f’ and ‘i’ occur next to each other, they may be replaced with the single combined form ‘fi,’ as shown in the word “finger.” (Note that the cross-bar of the ‘f’ is connected to the ‘i,’ and the dot of the ‘i’ has disappeared.) Similarly, in German, when two ‘s’ characters appear in succession, they are sometimes replaced by the single glyph ‘β:’ “grüβen” (to greet). However, in Arabic and Indic languages, contextual forms may depend on five, six or more surrounding characters, and the replacement form may bear little resemblance to the succession of glyphs corresponding to the individual characters standing alone. Languages with complex typographical rules include (without limitation): Assamese, Bengali, Bodo, Dogri, Gujarati, Kannada, Kashmiri, Konkani, Maithill, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santhali, Sindhi, Tamil, Telugu, Urdu and Hindi.
It can be very difficult for someone who has not learned to read one of these languages to examine text written in the language and pick out even the most egregious typesetting errors. Therefore, development and testing of complex text layout systems may be time-consuming and expensive. A system to assist or automate text rendering engine testing may be of considerable value.