1. Field of the Invention
The invention relates to digital television systems and more particularly to hashing methods for efficient storage and quick retrieval of characters for languages with large character sets in a set-top box of the system.
2. Description of the Related Art
The emerging technology of digital television systems holds a promise of allowing a television set to provide a vast array of new services. Digital television systems are capable of displaying text and graphic images in addition to typical video program streams. An example of digital television services which make use of text and graphic image display is interactive television. Proposed features of interactive television accommodate a variety of marketing, entertainment and educational capabilities such as allowing a user to order an advertised product or service, compete against contestants in a game show, or request specialized information regarding a televised program.
Typically, the interactive functionality is controlled by a set top box connected to a television set. The set top box executes an interactive program written for a television broadcast. The interactive functionality is displayed upon the television set screen and may include icons or menus to allow a user to make selections via the television's remote control.
Interactive television, and other broadcast communication systems in general which deliver textual data, often require support for multiple languages. For example, program guides are advertising tools used by program providers may desire to include descriptions of television programs in multiple languages, such as English mixed with Japanese. In addition, end users may receive data from non-native regions, such as a Chinese broadcast being received by a television viewer in India.
It is highly desirable that a set top box owner be able to use the same set top box to receive textual information in more than one language. That is, it is desirable that the user not have to buy a different set top box to receive textual information in each different language. A language, in this context, may be defined as a written system of representing thoughts, ideas, actions, etc. A language includes, inter alia, a grammar, characters, and words.
The characters and symbols used in writing a language are commonly referred to as a "writing system", or "script." Many languages, such as Western European languages, are written with alphabetic and numeric characters. However, Japanese, for example, is written with phonetic Hiragana and Katakana characters as well as alphabetic and numeric characters from Western languages and the ideographic Kanji characters which are largely taken from the Chinese language. The scripts of many languages may share common characters, as in the Western European languages.
The textual information received by a set top box includes strings of characters. A "character" is an atomic symbol in a writing system. In alphabetic languages, this symbol consists of a single letter of the alphabet. In ideographic languages such as Chinese and Japanese, a character could be alphabetic, phonetic, or ideographic.
A "character set" is a group of characters used to represent a particular language or group of languages. A "character encoding" is a system for numerically representing the characters of a character set. A well-known example of a character encoding is the ASCII character encoding. The numeric value associated with a given character in a character set is referred to as a "code point", or "encoding value." The set of numeric values associated with a code set is referred to as a "code set."
The ASCII character encoding provides an encoding for a character set of the alphabet, numbers, and other characters used in the English language. The ASCII code set includes the values 0-127. Thus, each ASCII character has a unique assigned value which may be contained in 7 bits of a byte of data. For example, the character `A` has a value 0.times.41 associated with it in ASCII. Many software library routines have been developed to manipulate, read and write strings of ASCII characters.
Other character encoding sets exist, which provide support for multiple languages, such as the ISO Latin character encoding which is used to represent many of the alphabetic languages in the world. ISO Latin includes a Basic Latin portion range of values (0-127) and an Extended Latin portion (values 128-255).
Another example of a character encoding is the Japanese Industrial Standard (JIS) character encoding. JIS uses a 7-bit multi-byte encoding mechanism to represent Japanese text.
A character encoding which enables the representation of characters from many different languages and character sets using a single encoding scheme is referred to as a "multi-lingual" character encoding. An example of a multi-lingual character encoding is the EUC (Extended UNIX Code) character encoding standard. EUC is typically used to represent ideographic Asian languages in the UNIX environment. EUC combines single byte ASCII characters with multi-byte ideographic character encodings. However, EUC allows only a few languages to be encoded at a time.
Developing new software library routines to deal with strings in multiple character encodings and/or multiple languages may be prohibitive in terms of cost and time. Furthermore, it may be prohibitive in terms of storage space and/or code maintenance to support libraries to handle characters in multiple character encodings and languages.
Some scripts combine characters to form composed characters whose shape is determined by the relative positions of the characters, i.e., the context of the characters. Examples of these "contextual scripts" are scripts for the Arabic, Hebrew, Thai, and all Indic languages. In contrast, "non-contextual scripts", such as the Roman alphabet used in Western languages, represent each character as a separate object of fixed shape, independent of the position in a word and of the neighboring characters.
Each character of a character set has a unique shape which distinguishes it from other characters in the character set, that is, which allows a reader to distinguish the character from other characters and thus unambiguously convey information. The shape assigned to a particular character is referred to as the "glyph" of the character. The English letter `A`, for example, has a unique glyph which makes it recognizable from other characters.
Glyphs may have a particular style associated with them. That is, an English `A` may be written in many different styles, such as in a block style or a calligraphic style. However, the style maintains the basic shape of the character such that the glyph is still recognizable as an `A.` A collection of glyphs sharing a common style is referred to as a "font." Examples of common fonts are Courier, Times Roman, and Helvetica.
A variety of glyph representation schemes exist. A common scheme is a bitmap glyph, or font. In a bitmap font, the glyph of a given character includes a sequence of bits corresponding to an array of pixels on a display screen. Each bit indicates if the corresponding pixel is to be illuminated or not based on the value of the bit. The pixel array has a characteristic width and height. For example, a glyph may be 24 pixels wide and 24 pixels high. In this example, 576 bits, or 72 bytes, of storage are required to store the glyph.
If the glyphs in a font are the same number of pixels in width, the font is said to be a non-proportional font. If the width is variable, the font is said to be a proportional font. Another common glyph representation scheme is an outline font. A property of outline fonts is that they typically facilitate scaling and rotating.
A set top box receives text encoded according to a character encoding and displays the text on a television. The act of processing the image of a character, i.e., the glyph associated with the character, and displaying the character is referred to as "rendering." A rendering program must use font type information, size information, and potentially contextual information in order to properly render a given character in a given script.
Transmission bandwidth in digital broadcast systems is a precious commodity. Hence, there is a motivation to minimize the number of bytes transmitted to the set top box with regard to the displaying of text.
Languages which have a relatively large number of characters, such as Chinese, Japanese, and Korean, pose particular problems in the context of text processing and rendering in digital television systems. One problem is the large time to search through such a large set of characters to find a glyph associated with a given code point. The combined Chinese, Japanese, and Korean character sets constitute over 120,000 characters. Secondly, the amount of memory required to store fonts and/or transmission bandwidth required to transmit fonts may be costly.
In many circumstances, set top boxes are a commodity item. Hence, a multi-lingual capable set top box which costs significantly more than a uni-lingual set top box may not be accepted readily in the market place. On the other hand, the set top box must deliver performance which is acceptable at a given cost. Thus, the factor of cost versus performance figures in to the design of a set top box.
Two components of a typical set top box which have a large bearing on its cost are its memory and processor. If multiple languages are supported, particularly if the languages have a large number of characters, such as Chinese, Japanese, or Korean, a large amount of memory may be required to store the fonts for the languages. More powerful processors provide higher performance of functions such as character lookup and rendering, but at a greater cost.