Technical Field
Embodiments described herein generally relate to processors. In particular, embodiments described herein generally relate to processors having instructions that are useful for transcoding variable length code points of Unicode characters.
Background Information
Computers fundamentally process binary numbers. They generally do not process the various different types of letters, decimal numbers, symbols, or other characters used in the various different languages and traditions. Rather, these different letters, decimal numbers, symbols, and other characters are assigned and represented by binary numbers.
The Universal Character Set (UCS) is a standardized set of characters upon which several character encodings are based. UCS is defined by the International Standard ISO/IEC 10646, Information technology—Universal multiple-octet coded character set (UCS), along with amendments to this standard. The UCS includes a large number of different characters including the letters, numbers, symbols, ideograms, logograms, and other characters from the most prevalent languages, scripts, and traditions of the world. Each of these characters is identified by an integer number that is referred to as that characters code point.
The Unicode Standard (Unicode) has been developed in tandem with USC. Unicode represents a computing industry standard for the consistent digital encoding, representation, and handling of the characters of the UCS. Unicode reportedly provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is currently used by almost all modern computers and serves as a foundation for processing text on the Internet.
Unicode may be implemented through various different character encodings. One commonly used encoding is UTF-8 (UCS Transformation Format-8-bit). UTF-8 is a variable-length (e.g. variable number of bytes) encoding that can represent every character in Unicode. Each Unicode character is represented with between one and four bytes. The bytes are also referred to as octets in the Unicode standard. UTF-8 uses one byte to represent any of the ASCII characters. UTF-8 is backward-compatible with ASCII and the characters have the same encoding in both ASCII and UTF-8. Other non-ASCII characters are represented by two, three, or four bytes. It is estimated that UTF-8 is the predominant encoding of Unicode in web pages on the world-wide web with more than half of all web pages estimated to be encoded using UTF-8. UTF-8 is also widely used by e-mail programs to display and create mail. Increasingly, UTF-8 is also being used to encode Unicode characters in certain programming languages, operating systems, application programming interfaces (APIs), and software applications.
Another commonly used encoding is UTF-16 (UCS Transformation Format-16-bit). UTF-16 is a variable-length (e.g. variable number of bytes) encoding that can represent every character in Unicode. Each Unicode character is represented with either two or four bytes. UTF-16 is not backward-compatible with ASCII. UTF-16 is commonly used as the internal form of Unicode in certain programming languages, such as, for example, Java, C#, and JavaScript, and in certain operating systems. Various other known encodings are also used (e.g., UTF-2, UTF-32, UTF-1, etc.).
Commonly, in order to facilitate processing within computer systems, UTF-8, UTF-16, or other encoded data, may be transcoded into another format, such as, for example, Unicode. Transcoding represents the direct digital-to-digital data conversion of one encoding to another. Such transcoding may be done for various reasons, such as, for example, to help improve the efficiency or speed of processing the data, to convert the encoded data to a format used by software or a more widely recognized format, etc. Often a large amount of processing is needed to transcode the content of web pages, documents formatted in mark-up languages, XML documents, and the like, from one encoding (e.g., UTF-8) into standard Unicode characters or other formats. Due to the prevalence of such transcoding and/or its potential impact on performance, new and useful approaches for transcoding would offer advantages.