The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Computer systems and software developed for enterprises should be able to handle differences that arise due to the different languages that are used. While text, numerical and other characters in Latin languages such as English, French and Spanish are always read from left-to-right (LTR), other languages such as Hebrew or Arabic are not. In these other languages, text is read from right-to-left (RTL) and numerical characters are read from LTR. In other words, these languages are bidirectional since they include a mixture of RTL and LTR.
Data is stored by computers as encoded data. When displaying or printing characters of a script, the computer identifies characters in the encoded data. The characters or glyphs are associated with the scripts. The order of the characters in memory (logical) may not be the same as the order in which they are displayed (visual). Bidirectional standards such as Unicode® were developed to handle display ordering issues that arise when entering, exporting, displaying and/or printing bidirectional data.
In operation, the computer may generate user interfaces, screens or web pages including forms with one or more input controls such as text boxes, drop-down lists, etc. that prompt a user to enter data. The input controls are usually designed to ensure that the user enters the data correctly and in the proper format. The computer system stores the data that is input either locally or remotely. A user may subsequently request access to the data and the computer retrieves the data and outputs the data to a screen or printer or exports the data to another program or file.
Some data values that need to be entered include multi-segment data values. The multi-segment data values typically include two or more segments that may be separated by a delimiter. Each segment may include text characters, numeric characters and/or other characters. In some situations, the multi-segment data values may need to be structured in the sense that the relative ordering of the adjacent segments should be maintained when using LTR or RTL embedding directions.
In bidirectional data codes, different types of characters are assigned strong, weak or neutral directional behavior. For example, numeric characters are always entered and consumed in a LTR manner. This is true whether they are consumed as numbers representing a numeric value or as a string of numeric characters representing a concept such as an ID number.
Latin characters are also always entered and consumed in a LTR manner. Arabic characters are always entered and consumed in a RTL manner. Some special characters are considered neutral and without direction such as, but not limited to, hyphen (-), bar (|) and period (.). These neutral type characters are commonly used as segment delimiters and take on the directional characteristics of the characters surrounding them.
For example, the multi-segment data value may include two or more segments and each segment may include English, Arabic or numbers. In some cases, an application implementing the bidirectional data standard for display of bidirectional text may add directional formatting characters (that are not displayed) to re-orient the input data into “natural reading order” and subsequently modify the ordering of the multi-segment data value.
For example, a multi-segment data value includes a first segment that names the movie and a second segment that describes the status. In LTR natural reading order the presentation would be seg1-seg2. While in RTL natural reading order, the segments would be seg2-seg1. For example, a multi-segment data value that naturally reads “Episode 1-Now showing” when using mixed language is correctly displayed as “-Episode 1”. That is, the RTL reader will begin on the right part of the string and read “Episode 1” and then read the second part of the string (in Arabic) “Now Showing”.
In another example demonstrating undesired display of multi-segment data values, the application is using a RTL language and the user types the following segments sequentially using a single neutral type character separator between each segment value: seg1=“123”, seg2=“456” and seg3=“”. In this example, a combination of segment values being entered in a single string. The user would like to enter three segment values seg1-seg2-seg3. In a RTL language, the user would like to see the segments oriented from the right: seg3-seg2-seg1. In this example, the first two segments are numeric. According to the bidirectional code, numbers always flow from left to right. The input control has no intrinsic insight into the fact that these values are not a single run of characters, but rather two distinct sets of values separated by a hyphen. Therefore the first two segments are treated as a single run of characters in LTR orientation 123-456. The introduction of an Arabic text segment begins a new run of characters. Since Arabic is read RTL and the first run of numbers is LTR, this means that the Arabic run of characters comes AFTER the LTR run of characters and is then placed on the left side of the string. Therefore, seg3-seg1-seg2 is displayed instead of the desired seg3-seg2-seg1.
The neutral character “-” is taking on the characteristics of the data that is surrounding it. In RTL languages, numerals are read LTR. Non-numeric characters are read RTL. With the neutral delimiter “-” between the numeric characters 3 and 4, the bidirectional standard follows the LTR pattern of numeric characters 3 and 4 surrounding the delimiter and outputs -123-456 instead of “-456-123.
Bidirectional data standards may use a set of directional formatting characters that influence the display ordering of text and that are not actually displayed. To maintain the relative order of the segments in structured, multi-segment data values, the bidirectional data standard requires two directional formatting characters per segment. The formatting characters are used to influence the display ordering of text and are intended to be ignored for text comparisons, numerical analysis or other situations. However the formatting characters may not be ignored and may cause failures when comparing otherwise identical text. The additional formatting characters require additional code, increase bandwidth when sending structured, multi-segment data values and require additional storage when storing structured, multi-segment data values.
One approach for correcting this issue involves padding each segment with directional formatting characters to control the direction of the text and segments. There are often two or more ways to implement the directional formatting characters specified by the bidirectional code. In other words, there are multiple ways to implement the directional formatting in the bidirectional code that would result in the same desired formatting and that may not be identical byte-wise. The differences may tend to cause text comparison issues.
Another approach is to separate each segment into a separate field. This approach, however, will eliminate certain functional capabilities such as combining and displaying the individual segments as a single display string (e.g. for reporting, email or other text-only communication). This approach requires additional effort to store, retrieve and manage the data using database systems. While this approach allows control of the local visual ordering, when the segments are shared with other systems there is no meta information that describes the correct way to display each segment. Therefore, two viewers looking at the same data using the same software display in different languages may have different viewing experiences as there is no intrinsic guidance on how to display the segments.