Since their inception, the basic components of computers are still the same: a computer processor and a memory. A computer processor is the active component of the computer system and manipulates data retrieved from the computer's memory to process tasks, programs, or processes assigned to the computer system. Computer memory stores information used by the computer and works in much the same way as the memory of a person. For example, just as humans memorize lists, poetry and events, a computer system stores words, numbers, pictures, etc. in its memories. Similarly, specialized hardware within a computer processor reads and interprets information from computer memory analogous to a human reading and interpreting printed words. And just as the arrangement of words on a page is important to human readers, the arrangement of information in the computer's memory is important to the computer system.
In the past, the choice of coding or data format was not a significant problem because computers seldom interchanged data or did so in ways that were not dependent upon data formats. But, as we all know, that universe was short-lived and computers became increasingly networked by local area networks, wide area networks, and even the Internet. The data format problem, i.e., transforming data between computers having different formats, became more severe. It seemed that operating systems, programming languages, computer architectures had preferences for a particular data format, one that was often proprietary. As long as the data stayed on the same kind of machine and the programs used the same compiler, differences in byte order, rounding, and the like caused no problem. If, however, the purpose of an application program is to analyze data from a variety of sources, such as in international trade and banking, the program must now cope with a wide variety of data formats specifying byte order, rounding, and integer sizes, etc., depending on the particular machine and compiler chosen. Even today, it is possible for source code, especially in a language like C or C++, to adapt to different data structure layouts through standard recompilation, and for many programs, that is the end of the story. Exacerbating those dilemmas of compatible machine codes was the problem of international communication and commerce wherein the barriers of human languages also had to be surmounted. This situation was more often encountered in the world of international commerce and large mainframe computers and servers which served multinational businesses until the Internet pervaded homes and electronic commerce took another giant leap. Even so, unless users all have the same computers, large multinational corporations find it still difficult to distribute and share information, especially with other multinational suppliers or customers whose choice of computers cannot be controlled and aren't always compatible with the same data format. Consequently, computer software developers devote enormous time and resources to develop multiple versions of the same software to support different computer data formats, different computer systems, and different languages.
Today data is transferred through networks using formally defined protocols. Protocol information may be defined by international standards committees and include, e.g., the ISO/OSI protocol stack, CCITT recommendations for data communications and telephony, IEEE 802 standards for local area networking and ANSI standards. Other national examples include the TCP/IP protocol stack, defined by the U.S. Department of Defense, military and commercial data processing standards such as the U.S. Navy SAFENET, XEROX Corporation XNS protocol suite, the SUN MICROSYSTEMS NFS protocol, and compression standards for HDTV and other video formats. The point is—there are numerous data transfer protocols in which byte order and other features of the data structure layout are predetermined. Data transfer between or among different transfer protocol systems compounds the problem because now data transfer must be across human languages, computer processor data storage formats, operating systems, programming languages, and now data transfer protocols. It's a complicated world.
The Unicode Standard, referred to herein as Unicode, was created by a team of computer professionals, linguists, and scholars to become a worldwide character standard that could be easily used for text encoding everywhere on the planet. Unicode follows some fundamental principles, examples of which include a universal repertoire, logical order, use of characters rather than glyphs, dynamic composition, maintenance of semantics, equivalent sequence, and convertibility. Unicode consistently encodes multilingual plain text thereby enabling the exchange of text files across human language barriers to greatly simplify the work of computer users who deal with multilingual text. Mathematicians and scientists who regularly use mathematical symbols and other technical characters, also find Unicode invaluable.
The design of Unicode is based on the simplicity and consistency of the American National Standard Code for Information Exchange (ASCII) but goes far beyond ASCII's limited ability to encode only the Latin alphabet, even though its first 156 characters are taken from ASCII's Latin-1 character set. Unicode provides the capacity to encode all of the characters used for the major written languages of the world incorporating the character sets of many existing international, national and corporate standards. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia. Unicode further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc. Duplicate encoding of characters is avoided by unifying characters within scripts across languages; for example, the Chinese, Japanese, and Korean (CJK) languages share many thousands of identical characters because their ideograph sets evolved from the same source, so a single code is assigned for each kanji or ideograph common to these languages. For all scripts, Unicode text is in logical order within the memory representation, corresponding to the order in which text is typed on the keyboard. Unicode has characters to specify changes in direction when scripts of different directionality are mixed, for example, Arabic and English. Unicode addresses only the encoding and semantics of text and does not check for spelling, grammar, etc.
The basic building block of all computer data is the bit, any number of which, usually a multiple of two, may comprise a byte and any number of bytes, again usually a multiple of two, may comprise a word. In some formats, a byte of data is eight bits. Four bytes or thirty-two bits of data is a word; a half-word is two bytes or sixteen bits; and a double word is eight bytes or sixty-four bits. The original goal of Unicode was to use a single 16-bit encoding to provide code points for more than 55,000 characters. Although 55,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, Unicode supports three encoding forms that use a common repertoire of characters but allow for encoding as many as a million more characters. In all, Unicode Version 4.0 provides codes for 85,221 characters from the world's alphabets, ideograph sets, and symbol collections.
Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits. Unicode defines three Unicode transformation formats (UTFs) that allow the same data to be transmitted in a byte, word or double word oriented format, i.e. in 8, 16 or 32-bits per code unit. All three transformation formats encode the same common character repertoire and can be efficiently transformed into one another without loss of data. UTF-8 is popular for hyptertext markup language (HTML), popular for use on the world wide web and the Internet, and similar protocols can transform all Unicode characters into a variable length encoding of bytes. UTF-8 is particularly useful because its characters correspond to the familiar ASCII set and have the same byte values as ASCII so that Unicode characters transformed into UTF-8 can be used with existing software without software rewrites. UTF-16 is useful when there is a need to balance access to characters with use of storage. The characters that are most often used fit into a single 16-bit code unit and other characters are accessible via pairs of 16-bit code units. UTF-32 is popular where memory space doesn't matter, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
To avoid deciding what is and is not a text element in different processes, the Unicode characters correspond to the most commonly used text elements. Each character is assigned a unique number/name that specifies it and no other. Each of these numbers is called a code point and is listed in hexadecimal form following the prefix “U.” For example, the code point U+0041 is the hexadecimal number: 0041 and represents the character “A” in Unicode. Unicode retains the order of characters where possible and the characters, also called code elements, are grouped logically throughout the range of code points, called the codespace. The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts, followed by symbols and punctuation, and continuing with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modern Hangul. Code blocks vary greatly in size; for example, the Cyrillic code block does not exceed 156 code points, while the CJK code blocks contain many thousands of code points. Towards the end of the codespace is a range of code points reserved for private use areas or user areas that have no universal meaning which may be used for characters specific to a program or by a group of users for their own purposes. Following the private user space is a range of compatibility characters that are encoded only to enable transcoding to earlier standards and old implementations, which made use of them.
Around 1965 IBM announced a new computer series, the System 360 that evolved into the System 390 and into the present zSeries. This computer introduced a new character coding set, the extended binary-coded decimal interchange code (EBCDIC), of 156 eight-bit characters based on Hollerith punched card conventions. When it turned out that this development followed a totally different encoding scheme from ASCII, where the heritage of paper tape is clearly discernible, it was already too late. IBM had invested far too much to change the design. In the course of time, even EBCDIC got national versions and now EBCDIC no longer means a single codetable. EBCDIC had been the most frequently applied character code up to the late 1970s. Only with the advent of the personal computer did ASCII use begin to increase. Yet even today, the entire world of IBM mainframe computers and large servers is still dominated by EBCDIC.
These integrated mainframe systems, sometimes referred to as legacy systems, continue to store programming language statements such as for Report Program Generator (RPG) and Distributed Data Services (DDS) in EBCDIC encoded files. Each statement within a file in EBCDIC encoded files has the same byte length. The statements of fixed length, moreover, are divided into fields of fixed length wherein each field has a predefined starting byte position. The length of each field and therefore the length of the statement is defined as the number of bytes that the field or statement occupies in physical memory.
Files encoded in EBCDIC having both fields and statements of fixed byte length may be downloaded to a workstation implementing Unicode for revision. When the file is downloaded, the file content is converted from EBCDIC to Unicode. Conversely, when the file is uploaded from the workstation to the legacy system, the file content is converted from Unicode to EBCDIC. Typically, prior art conversion methods that convert from EBCDIC to Unicode and from Unicode to EBCDIC are unaware of statements and are, therefore, unaware of the length of statements within the file. Although the same statement represented in Unicode on the workstation has the same number of characters, it may have a different byte length because the characters are represented differently. Recall that each Unicode character may have a different byte length than its EBCDIC equivalent so, for instance, a statement in Japanese in Unicode may consist of ten Unicode characters or twenty bytes; the same statement in EBCDIC may consist of four single byte characters followed by six double byte characters, for a total of sixteen bytes. If the file has not been edited on the workstation, the field lengths and statement lengths remain correct. If, however a statement in the file is altered through insertion, deletion, or replacement of characters on the workstation before the file is converted back to EBCDIC, fields and/or statement lengths may become different than the original fixed statement length resulting in invalid statements.
On the workstation, each character is displayed as one Unicode character but because a Unicode character may be the equivalent of multiple bytes, it may not be interpreted correctly in a mixed EBCDIC encoding of a legacy system. An editing program, moreover, may extract fields from a statement, modify the fields, and reassemble the individual fields to form a new statement. In today's world of graphical user interfaces, an editing program may display each field of a programming statement in a different colour. In each of these cases, the editor needs to know, based on the number of bytes in the field, which group of Unicode characters form a field.
Current Unicode string manipulation classes assume that lengths are defined as a number of Unicode characters. This assumption is wholly inadequate for the case cited above, i.e., when a statement in the file is altered through insertion, deletion, or replacement of characters on a Unicode workstation before the file is converted back to EBCDIC. Thus, the statement length may change from the original fixed statement length resulting in invalid statements. Thus, the industry requires a new Unicode string manipulation class in which lengths are defined as a number of bytes in the legacy code page encoding, and the length of fields and statements remain constant.