1. Field of the Invention
This invention relates to the technologies of computer displays and interpretation of file and data for display on a computer. This invention especially relates to the technologies of universal text encoding, markup languages, and data-to-display methods.
2. Description of the Related Art
The many competing motivations for selecting codepoints within a text encoding standard, such as the Unicode standard, threaten the fundamental purpose of a character encoding: data. Digital data is immensely convenient because the advantages of its great simplicity outweigh the loses incurred by representing knowledge imperfectly.
Often, in pursuit of all the benefits of such as standard, we set our sights on recovering on what has been left out. For many years, numerical analysts have been systematically improving fidelity of computer models of the apparently continuous world around us. They are helped by the mathematical properties of real numbers. A more difficult challenge is text which represents language.
In fact, we contend that the ability to interpret raw text has become more difficult. A text stream is no longer just a sequence of agreed upon codepoints. Text manipulation processes require additional information for proper interpretation, such as displaying the encoded text on a computer display or mobile telephone display.
There has been substantial interest in introducing an architecture for describing language and other semantic information within raw Unicode streams.
The need for expressing metadata, e.g. information describing data, has existed ever since humans started communicating each other. Prior to written communication, metadata was expressed through our verbal speech. The tone, volume, speed in which something was spoken often signaled its importance or underlying emotion. Often, the metadata may be as significant or even more significant that the data itself, and often much more difficult to codify.
Writing and printing systems also have a need for metadata. This was conveyed through the use of color, style, size of glyphs. Initially, this metadata was used as a mechanism for circumventing the limitations of early encoding schemes. As our communication mechanisms advanced so did our need for expressing metadata.
FIG. 1 presents the Unicode character/control/metadata model, including an application layer (10), a control layer (11), a character layer (12), a codepoint layer (13), and a tranmission layer (14). Unicode is well known in the art, and many alternate representations can be found in widely available literature.
A primary need for metadata in Unicode occurs in the control layer (11), as one may anticipate. In FIG. 1, a dotted line is used to separate the character layer (12) from the control layer (11) to illustrate the sometimes difficult to define boundary separating characters from control. This inability to provide a clean separation has made the task of developing applications (10) that are based on a Unicode more difficult to implement.
For greater understanding of the present invention, a historical summary is first presented which demonstrates the need for metadata within character encodings. Second, an examination of the presently available paradigms for expressing metadata is provided. In particular, attention is given to both extensible markup language (XML) and Unicode's character/control/metadata model.
Baudot's 5-bit teleprinter represents one of the earliest uses of metadata Baudot divided his character set into two distinct planes, named Letters and Figures. The Letters plane contained all the Uppercase Latin letters, while the Figures plane contained the Arabic numerals and punctuation characters. These two planes shared a single set of code values.
To distinguish their meaning, Baudot introduced two special meta-characters, letter shift “LTRS” and figure shift “FIGS”. When a sequence of codepoints were transmitted, it was preceded by either the FIGS or LTRS character. This permitted the characters to be interpreted unambiguously. This is similar to the shift lock mechanism in typewriters. For example, line 1 in FIG. 2 spells out “BAUDOT” while line 2 spells out “?-7$95”, as shown in TABLE 1.
TABLE 1Using LTRS and FIGS in Baudot code10x1F 0x19 0x03 0x07 0x09 0x18 0x10 BAUDOT20x1B 0x19 0x03 0x07 0x09 0x18 0x10 ?7$95 (2)
However, this method still left the problem of how to transmit a special signal to a teleprinter operator. Baudot once again set aside a special code point, named bell “BEL”. This codepoint would not result in anything being printed, but rather it would be recognized by the physical teleprinter. The teleprinter, having recognized the BEL, character would perform some action, such as ringing of a bell.
About 1900, metadata characters began to be used as format effectors, such as can be seen in Murray's code. Murray's code introduced two additional characters: (a) column (COL) carriage return in International Telegraphy Alphabet Number 2 (ITA2), and (b) line page (LINE PAGE) line feed in ITA2. These two codes were used to control the positioning of the print wheel, and to control the advancement of paper. This encoding scheme was used for nearly fifty years with little modification. It also served as the foundation for future encoding techniques.
During the late 1950s and early 1960s, telecommunication hardware rapidly became much more complex. This complexity, however, resulted in the need for more sophisticated protocols, and for greater amounts of metadata. For this purpose, the US Army introduced a 6-bit character code called “FIELDATA.” FIELDATA introduced the concept of “supervisor codes”, known today has “control codes.” These codepoints were used to signal communications hardware.
The hardware manufacturers were certainly not the only users of metadata, however. It did not take long for the data processing community to realize that they also had uses for metadata. This unfortunately taxed the existing encoding schemes (5-bit and 6-bit) so much so as to render them unusable, as all of the potential codes to be incorporated to address all of the user needs could not be represented in such a small code space.
This drove the creation of a richer and more flexible encoding scheme. These issues were directly addressed by the American Standard Code for Information Interchange (ASCII).
The ASCII code, a 7-bit encoding, served not only as a mechanism for data interchange, but also as an architecture for describing metadata. This metadata could be used for communicating higher order protocols in hardware as well as software. The architecture is based upon ASCII's escape character (ESC) at hex value 0×1B.
Initially, the ESC was used for shifting to one or more character sets. This was of a particular importance to ALGOL programmers. As ASCII was adopted internationally, the ESC became useful for signaling the swapping in and out of international character sets. This concept was later expanded in 1980s in the International Standards Organization (ISO) ISO-2022 standard.
ISO-2022 is an architecture and registration scheme for allowing multiple 7-bit or 8-bit encodings to be intermixed. It is a modal encoding system like Baudot. Escape sequences or special characters are used to switch between different character sets or multiple versions of the same character set. This scheme operates in two phases. The first phase handles the switching between character sets, while the second handles the actual characters that make up the text.
Non-modal encoding systems make direct use of the byte values in determining the size of a character. In such a scheme, characters may vary in size within a stream of text, typically ranging from one to three bytes. This can be witnessed in the well-known UTF-8 and UTF-16 encodings.
In ISO-2022, up to four different sets of graphical characters may be simultaneously available, labeled G0 through G3. Escape sequences are used to assign and switch between the individual graphical sets. For example, line 1 in TABLE 2 shows the byte sequence for assigning the ASCII encoding to the G0 alternate graphic character set. Line 2 of TABLE 2 shows the Latin-1 encoding being assigned to the G1 set.
TABLE 2Example ISO-2022 Escape Sequences1ESC 0x28 0x42assign ASCII to G02ESC 0x2D 0x41assign Latin 1 to G1
Most data processing tools make little if any distinction amongst data types. The only distinctions being purely human user interpretation. Data is simply viewed by the processing tools in terms of bytes. For example, the common UNIX text searching utility known as GREP assumes that data is represented as a linear sequence of stateless fixed length independent bytes. GREP is highly flexible when it comes to searching, whether it be characters or object code. This model has served well under the assumption that one character equals one codepoint, but encoding systems have advanced and user expectations have risen.
Over the last ten or so years, Unicode has become the defacto standard for encoding multilingual text. This has brought a host of new possibilities that only few could have previously imagined. Users however, want more than just enough information for intelligible communication. Plain text in its least common denominator is simply insufficient.
There have been several discussions concerning the enrichment of plain text of which ISO-2022 is one. Even XML can be viewed in this framework. Both concern meta information yet have different purposes, goals, and audiences. The transition from storing and transmitting text as plain streams of code-points is now well underway.
Extensible markup language (XML) provides a standard way of sharing structured documents, and for defining other markup languages. XML uses Unicode as its character encoding for data and markup. Control codes, data characters, and markup characters may appear intermixed in a text stream.
When this situation is combined with overlapping mechanisms for encoding higher order information, confusion and ambiguity may ensue when processing or interpreting the encoded data There may exist situations in which markup and control codes should not be interleaved. This issue is quickly coming to realization within XML and Unicode.
Whitespace characters in XML are used in both markup and data. The characters used in XML to represent whitespace are limited to “space”, “tab”, “carriage return”, and “line feed”. Unicode, on the other hand, offers several characters for representing whitespace. In particular, the line separator U2028 and the paragraph separator U2029. Their use however within XML may lead to ambiguities due to the additional implied semantics.
In Unicode, these characters may be used to indicate hard line breaks and paragraphs within a stream. These may affect visual rendering, as well as serve as separators. When used within XML, however, it is unclear whether the implied semantics can be ignored. Does the presence of one of these control codes indicate that a rendering protocol is being specified in addition to their use as whitespace, or are they simply whitespace?
The use of name “tags” within XML also posses problems. The characters in the Compatibility Area and Specials Area UF900-UFFFE from Unicode are not permitted to be used in names within XML.
Their exclusion is due in part to the characters being already encoded in other places within Unicode. By no means, though, is this the only reason. If characters from the Compatibility Area were included, the issue of normalization would then need to be addressed. In this context normalization refers to names being equivalent, but not necessarily the same. Additionally, characters that pose both a decomposed and precomposed form also need attention.
Unicode attempts to address these issues in Unicode Technical Report #15 “Unicode Normalization Forms”, which is freely available from the Unicode organization. Unicode provides guidelines and an algorithm for determining when two character sequences are equivalent. In general, there are two classes of normalization: Canonical and Compatibility.
Canonical normalization handles equivalence between decomposed and precomposed characters. This type of normalization is reversible. Compatibility normalization addresses equivalence between characters that visually appear the same, and is irreversible.
Compatibility normalization in particular is problematic within XML. XML is designed to represent raw data free from any particular preferred presentation. Characters that may be compatible for presentation purposes, however, do not necessarily share the same semantics. It may be the case that an additional protocol is being specified within the stream. For example, the UFB0 character on line 1 TABLE 3 is compatible with the two character sequence “U0066 U0066” on line 2. Line 1 however, also specifies an additional protocol: ligatures. In such a situation, it is unclear whether or not the names were intended to be distinct. It is difficult to tell when the control function (higher order protocol specification) of a character can be ignored and when it can not.
TABLE 3Example Compatibility Normalization Ambiquity1UFB00ff ligature2U0066 U0066ff no ligature
Further, some have argued that Unicode's Normalization Algorithm is difficult to implement, resource intensive, and prone to errors. To avoid such problems XML has chosen not to perform normalization when comparing names.
Problems such as these are due to the lack of separation of syntax from semantics within Unicode. The absence of a general mechanism for specifying protocols “metadata” only serves to confound these issues even further.
There are two well-known general approaches to encoding metadata within text streams: in-band signaling and out-of-band signalling. Inband signalling conveys metadata and textual content using a single shared set of characters, while out-of-band signalling conveys metadata independently from the data. In-band signalling is employed within hyper text markup language (HTML) and XML.
Determining whether a character is data or metadata using in-band-signalling depends on the context in which a character is found. That is, code points are “overloaded.” This achieves maximal use of the character encoding, as characters are not duplicated. It also does not require encoding modifications as protocols change.
All of this, however, comes at the expense of the complexity of parsing the data. It is no longer possible to conduct a simple parse of a stream looking for just data or metadata.
Using out-of-band signalling for describing Unicode metadata requires the definition and transmission of complex structures serving a similar purpose as document data type definitions (DTD) in XML. This has the ill effect of making the transmission of Unicode more intricate. It would no longer be acceptable to simply transmit the raw Unicode text. Without the metadata, the meaning of the raw text may be ambiguous. On the other hand, parsing of data and metadata may be trivial, given that the two are not intermixed. The transmission problems requiring pairs of raw data files and metadata files to be handled together often may outweigh the potential parsing benefits of out-of-band signalling, depending on the application.
It is still possible to construct a metadata signalling mechanism for the specific purpose of mixing data and metadata and yet allows for simple parsing. This is the approach that is currently under discussion within the Unicode community and can be found in Unicode Technical Report #7. It is called “light-weight in-band signalling”.
According to this proposed approach, this is achieved in Unicode through the introduction of a special set of characters that may only be used for describing metadata “tagging”. The current model under consideration within Unicode is to add 97 new characters to Unicode. These characters would be comprised of a copy of the ASCII graphic characters, a language character tag, and a cancel tag character. These characters would be encoded in Plane 14 “surrogates” U000E0000—U000E007F. These characters could then be used to spell out any ASCII based metadata protocol which needs to be embedded within a raw Unicode stream of text. This permits the construction of simple parsers for separating metadata from data since there is no overloading of characters.
The use of the tags is very simple. First, a tag identifier character is chosen, followed by an arbitrary number of unicode tag characters. A tag is implicitly terminated when either a non tag character is found or another tag identifier is encountered. Currently there is only one tag identifier defined, the “language” tag, as shown in TABLE 4. Line 1 in TABLE 4 demonstrates the use of the fixed codepoint language tag “U000E0001”, along with the cancel tag “U000E007F”. The plane 14 ASCII graphic characters are in bold and are used to identify the language. The language name is formed by concatenating the language ID from ISO-639 and the country code from ISO-3166. In the future, a generic tag identifier may be added for private tag definitions.
TABLE 4Example Unicode Light-WeightIn-band Signaling Language TagU000E0001 fr-Fr french text U000E0001 U000E007F
Tag values can be cancelled by using the tag cancel character. The cancel character is simply appended onto a tag identifier. This has the effect of cancelling that tag identifier's value. If the cancel tag is transmitted without a tag identifier the effect is to cancel any and all processed tag values.
The value of a tag continues until either it implicitly goes out of scope or a cancel tag character is found. Tags of the same type may not be nested. The occurrence of two consecutive tag types simply applies the new value to the rest of the unprocessed stream. Tags of differing types may be interlocked. Tags of different types are assumed to ignore each other. That is there are no dependencies between tags.
Tag characters have no particular visible rendering and have no direct affect on the layout of a stream. Tag aware processes may chose to format streams according to their own interpretation of tags and their associated values. Tag unaware processes should leave tag data alone and continue processing.
Although, the general light-weight approach to metadata definition is useful, it however posses two problems. First, new tag identifiers always require the introduction of a new Unicode codepoint. This puts Unicode as a standard in a constant state of flux, as well as fixing or limiting the number of possible tag identifiers. Second, there is no method to specify multiple parameters for a tag. This deficiency forces the creation of additional tag identifiers to circumvent this limitation.
As these specific illustrations and cases indicate, the handling of character data in information processing has always been troublesome. Small encoding mechanisms limit the potential trouble. Many compromises take place completely outside the character set while encoding the data.
On the other hand Unicode has enough space for lots of problems. This trouble has largely been centered around the inability to clearly separate the notions of syntax, semantics, and protocols.
The many demands placed on codepoints from Unicode has led to confusion in areas of text exchange, legacy interchange, glyph picking, and others. This confusion has intimidated adopters into non-conformance, consider Unicode normalization within XML and Java.
Therefore, there is a need in the art for a method and system which allows the present collection of convoluted, unused, and unimplementable Unicode algorithms to be recast in a more manageable context, and which allows the algorithms to become detectable, reversible as well as convertible. Further, there is a need in the art for this new method and system to provide extensibility to Unicode, such as is available in markup languages such as XML, without requiring new tag identifiers to be registered by a protocol controlling authority. Additionally, there is a need in the art for this new method and system to allow for an arbitrary number of control parameters to be specified in a data stream.