The present invention generally relates to text processing applications in computer systems, and more particularly to detecting layout of bidirectional (BIDI) text.
Most written languages such as Latin (or Cyrillic or Greek) text are written in a direction from left to right (LTR). However, some other written languages such as Arabic, Hebrew, Urdu, and Farsi (Persian) are written in a direction from right to left (RTL). When a text includes both LTR text segments and RTL text segments, each type of text should be written in its own direction, thus forming a bi-directional text, also known as “BIDI”. A computer system having a BIDI support capability can display texts of different languages on the same page and on the same line, even if the languages have different text directionalities.
However, BIDI rules are very complex, and the rules implemented by different software are usually not unified. Indeed, the same script can contain two or more kinds of texts having different writing directions, and texts having different writing directions can refer to each other or even refer in a multi-layer way. A BIDI document can contain special texts such as dates, numbers, formulae, etc.
Historically, BIDI data stored on legacy systems (e.g., mainframe systems) was in what is called a “visual” layout: that is, the data was stored in memory in the same order it is shown on displays (usually terminals or printers). This had the advantage that no special processing was needed to format the data for presentation, since it was already in presentation form. Since the data only existed on a single platform, it did not matter what form was used. With the advent of processing power closer to end users, the new personal computer systems turned to storing the BIDI data in what is called a “logical” layout. This means that the data are stored in memory in the order they are typed, not how they are displayed. This has the advantage that BIDI data can be processed as non-BIDI data (i.e., searching, sorting, and parsing can be done using the same modules used with non-BIDI data). In order to display BIDI data stored in logical layout, the system renders the data for presentation, which is usually done using BIDI Layout Engines (for text environments) or BIDI Layout Engines embedded in font (for graphical environments). Since the data only exists on the personal computer, it does not matter what form is used.
However, data in visual layout are still preponderant on certain computer systems, such as legacy systems (IBM zSeries® mainframes, IBM iSeries®), while in other systems like Windows® systems, most data are created and processed in logical layout. There also exist systems that can handle data in either logical or visual layout (e.g., IBM AIX®). Some graphical user interface (GUI) components, such as Java® GUI components, expect BIDI textual data to be in logical layout. BIDI text within HTML may be in either logical or visual layout, but it is generally more convenient to format the data in logical layout, and browser support for data in logical layout is also preferred. (As they may be cited herein, zSeries, iSeries, and AIX are registered trademarks of International Business Machines Corporation in the United States, other countries, or both; Java is a registered trademark of Sun Microsystems, Inc, in the United States, other countries, or both; and Windows is a registered trademark of Microsoft Corporation in the United States, other countries, or both.)
BIDI text stored in a specific bidirectional (BIDI) layout of one system cannot be displayed and processed properly on other systems which are using a different BIDI layout. In order to display such text properly on other systems, a process of BIDI transformation needs to be applied to transform the text from its BIDI layout format (source BIDI format) to another BIDI format (target BIDI format). There exist some BIDI transformation tools that allow for transformation of a BIDI text from one BIDI layout to another. These BIDI transformation tools have four prerequisites:                1) The source text BIDI layout should be known;        2) The target text BIDI layout should be known;        3) A manual configuration should be performed in order to associate the BIDI layout for the source with the source text; and        4) A manual configuration should be performed in order to provide the desired BIDI layout of the target text (i.e., output text), or a default BIDI layout format is assumed for the target text.        
However, in certain situations, the user is not aware of the source text BIDI layout. It also occurs that even if the user is aware of the source BIDI layout format, the manual configuration is not possible (e.g., there are many sources and it is difficult to configure BIDI layout for each of them, or all source text is received from a specific queue). In such situations, the text might be corrupted because it will be displayed in another BIDI layout and hence the text will not be readable.
Further, to fulfill the above requirements, the user has to use a configuration tool or a user interface (UI) to supply the proper BIDI layout format for the input text and output text. Accordingly, the system has to provide a proper GUI for configuration of the BIDI layout format per text source and text target. This puts an overhead on the end user, as well as consumes time and effort either from the system end user or the system developers. Another aspect of the problem is the usability and consumability problem, due to the need for the user to manually perform the configuration. In some applications—for example, an application that deals with a lot of sources with unknown BIDI layout format—the configuration is not possible due to the nature of the application.
A known solution to this problem is the approach taken in Unicode BIDI Algorithm, which is published as Annex 9 of the Unicode standard. Unicode Standard defines a basis for complete BIDI support. The standard specifies detailed rules on how to code and display an LTR and RTL mixed text. In the Unicode coding, all characters are stored in the writing order, while it is determined by software in what direction the text is to be displayed on a page or screen. Thus, all computer systems complying with the Unicode standard can display texts from different languages correctly in the same script, regardless of whether the writing directions of the texts are identical or not. The Unicode BIDI Algorithm defines optional steps that depend on setting the base direction attribute according to the first strong character. However, using the first strong character is risky since the user may enter English text as the first word in RTL text. Further, the first-strong-character approach assumes that the RTL language user always writes an RTL letter at the beginning of the text, but this is not always the case.