1. Technical Field
The invention relates to character mapping. More particularly, the invention relates to a method and apparatus using a list of  less than minimum, size, gap, offset greater than  quadruple to encode Unicode characters in an upper/lower case character mapping.
2. Description of the Prior Art
The Unicode standard, version 2.0 (UCS2) specifies character coding. The majority of the Unicode characters map to one Unicode for each character case mapping. Therefore, a solid implementation of a basic 1-to-1 character case mapping is an important foundation for a general locale sensitive, contextual based Unicode string case mapping. There are 65,536 (63,488 if one ignores surrogates) possible bit combinations of UCS2. The Unicode 3.0 standard allocates 57,709 characters. There are 10,617 entries in the Unicode character database (based on the Unicode 3.0 standard). There are 1,398 UCS2 characters which have a different case (see, for example, FIGS. 1-3).
It is highly desirable to provide basic 1-to-1 character case mapping information while using only small amount of memory and at a reasonable speed. Character mapping in Unicode is, at best, an onerous task. One problem with mapping Unicode characters is that of mapping between upper and lower case characters.
The problem may be stated as follows:
Given one Unicode string as input, output the upper (or lower) case string:
xe2x80x9cunicodexe2x80x9d- greater than xe2x80x9cUNICODExe2x80x9d
Various attempts have been made to solve this problem. These approaches include: a conditional offset; a flat array; an indirect array; a list of a mapping pair; and a compact array.
This approach is used widely in the early 7-bit US-ASCII implementation.
Data Structure and Algorithm
To upper case:
output=((xe2x80x98axe2x80x99 less than =input)andand(input less than =xe2x80x98zxe2x80x99))?(inputxe2x88x92xe2x80x98axe2x80x99+xe2x80x98Axe2x80x99):input;
To lower case:
output=((xe2x80x98Axe2x80x99 less than =input)andand(input less than =xe2x80x98Zxe2x80x99))?(inputxe2x88x92xe2x80x98Axe2x80x99+xe2x80x98axe2x80x99):input;
Result
This algorithm is very simple and compact. It basically compresses the case mapping information into three bytes for 7-bits US-ASCII. This algorithm is based on the assumption of the following important characteristics of the 7-bits US-ASCII definition:
1. All the lower case characters are encoded in a continuous range;
2. All the upper case characters are encoded in a continuous range; and
3. All the lower case characters have an equal offset to corresponding upper case characters.
The assumptions above do not fit for most other character sets (charset). The same approach applied to ISO-8859-1 requires three if statements. This approach is not general enough for other charsets. For Unicode, the required number of if statements is too big for any practical implementation of this approach.
The conditional offset approach lacks flexibility and is not general enough for other charsets. In the early 1980""s 8-bit charsets, such as ISO-8859-1, came into use. With the increased use of these charsets came a more flexible approach to solving the problem of character mapping for upper/lower case characters,. i.e. the flat array. The flat array requires considerable memory, but provides improved flexibility and performance. In fact, the flat array is still the preferred implementation for most single byte charsets.
Data Structure and Algorithm Use a flat array to contain the information. Use the input byte as the array index. Return the value of the array element as the output.
output=ToUpper[input];
Result
This approach uses more memory than the conditional offset approach. However, the performance is faster. It is a good choice when the total possible number of inputs is limited to less than or equal to 256 inputs. This approach is not practical for any multi-byte charset, such as Shift_JIS or Big 5. It therefore is not practical for use with Unicode. If this approach is used for multibyte charsets, then a significant amount of memory is required for the array.
For a single byte, the required memory for both an upper case and a lower case mapping is:
2[upper and lower]xc3x971[sizeof(char)=1 bytes]xc3x97256[28=256]=512 bytes.
For Unicode or other two byte charset, the required memory becomes:
2[upper and lower]xc3x972[sizeof(UCS2)=2 bytes]xc3x9765536[216=65535]=262,144 bytes.
While this approach is preferred for most single byte charset implementations, it is not a practical approach to implement Unicode or other multi-byte charsets.
After studying the distribution of the case characters in Unicode, it can be noted that all of the characters are encoded in several localized regions of the 16-bits space. If the 16-bits space is divided equally into 256 blocks, each block has 256 Unicodes, and all of the case characters are encoded in eleven of these blocks (see, for example, FIGS. 1-3).
The remaining 245 blocks do not encode any 1-to-1 case characters. Thus, it is possible to reduce the size of the memory required by the flat array approach by applying one level of indirection. The indirect array approach uses one flat array for each block and saves 245 flat arrays for those blocks which do not have 1-to-1 case characters.
The distribution of case characters in these blocks is set forth in Table 1 below.
Data Structure and Algorithm
if(array[input  greater than  greater than 8])
output=array[input greater than  greater than 8][input and 0x0FF]
else
output=input;
Result
This approach requires:
2[2=lower and upper]xc3x9711[xe2x80x9ccutxe2x88x92d; xe2x88x92f1,13,14 UnicodeData-Latest.txt|egrepxe2x88x92v xe2x80x9c;;$xe2x80x9d|cutxe2x88x92c1xe2x88x922|uniq|wcxe2x88x92Ixe2x80x9d= greater than 11]xc3x972[sizeof(UCS2)=2]xc3x97256[block size is 256]=11,264 bytes
for all the eleven blocks. Additional memory may be needed for the first index array. However, this step could be replaced by several if statements. The performance of this algorithm is slower than both algorithms above, but the amount of memory that is required is greatly reduced.
Because it is known that there are only 1,389 1-to-1 case characters, one can also compress this information into an array of  less than input, output greater than  pairs and use a binary search to find the desired information.
Data Structure and Algorithm
if( entry=BinarySearch(input, list))
output=entry- greater than out;
else
output=input;
Result
Because a search process is required, the algorithm is slower, but the amount of memory required is only:
xe2x80x832[2=in and out]xc3x972[sizeof(UCS2)=2]xc3x971398[Total number of entries in UnicodeData-Latest.txt which contains case information is 1,398 (see above)]xc3x972=5,592 bytes.
Because there are 1,398 entries, the depth of the binary search is 10 or 11 [210 less than 1,398 less than 211].
Java and ICU use a compact array to encode Unicode character properties in general. They also apply the same approach for case mapping.
Data Structure and Algorithm
output=array2 [array 1 [input greater than  greater than WINDOW_SIZE_BITS]][input and WINDOW_MASK];
See FIG. 4. In ICU 1.3.1 (icu/source/common/uchar.c), this approach first checks the case of the input character itself. This check is done by one compact array (array indices and values in uchar.c ). For ToUpper operation, if the case of the input character is lower case, then this approach uses a second compact array (caseIndex and caseValue in uchar.c) to determine the value of the other case.
Result
In ICU-icu/source/common/uchar.c [based on ICU-1.3.1], this approach uses 64 as the window size. Table 2 shows the size of the table.
The number of bytes listed here represent the total memory needed to implement both case mapping and character category checking. Therefore, it is not fair to say the required memory is 16,890 bytes. However, the required memory is at least 5,632 bytes for mapping to xe2x80x9cthe other case.xe2x80x9d
It would be desirable to provide basic 1-to-1 character case mapping information while using only small amount of memory and at a reasonable speed.
The invention provides basic 1-to-1 character case mapping information while using only small amount of memory and at a reasonable speed. The solution to this problem can be expressed as follows:
Given a Unicode character as input, convert it to the corresponding upper case character in Unicode.
Given a Unicode character as input, convert it to the corresponding lower case character in Unicode.
Implement the functionality above by using a small amount of memory with fast performance.
The presently preferred embodiment of the invention provides a technique that encodes the case mapping into a sequential list of  less than Minimum, Size, Gap, Offset greater than  quadruple. Every quadruple represents a range of characters. The Minimum and Size values represent the boundary of the range. The Gap represents which characters in the range have the valid mapping. Thus if the character Minimum is a multiple of the Gap, then the character has a mapping in the quadruple. Otherwise, the character does not have a mapping. If the character has a mapping, then the mapped value is the character plus the Offset.
The preferred algorithm for the to-lower (or to-upper) function is as follows:
1. Given input character C and the case mapping encoded in the sequential list of  less than MIN, SIZE, GAP, OFFSET greater than  quadruple L, use sixteen bits to encode MIN and OFF, use eight bits to encode SIZE and GAP.
2. Binary search C on the sequential list of  less than MIN, SIZE, GAP, OFFSET greater than  quadruple L. If ( MIN less than =C andand (C less than =(MIN+SIZE)) then find the match quadruple Q. Otherwise, continue binary search. If a match cannot be found for quadruple Q, the mapped value of character C is C itself (non caseable character, such as a digit or Chinese Han character).
3. If GAP in Q is 1 ((Cxe2x88x92MIN) % GAP) equal to zero, the mapped value of character C is (C+OFF). The OFF could be negative number, otherwise, the mapped value of character C is C itself.
4. By using this method, the whole Unicode 2.0 case mapping can be encoded in 618 bytes (103 quadruple, 6 bytes each) for the to-upper mapping, and 576 bytes (96 quadruple, 6 bytes each) for the to-lower mapping.
The invention takes account of the following characteristic of the code point assignment of casing characters in Unicode standard to make the compression more efficiency:
Some scripts/blocks do not distinguish upper case and lower case at all. For those scripts/blocks, there is no need to encode upper/lower case mappings.
When the Unicode standard assigns a code point for a caseable character, it usually either:
assigns a whole group of lower case characters together and puts the corresponding upper case characters together in the same sequence; or
assigns the upper case (or lower case) character next to it and repeats the same kind of assignment for a block of characters.
There are a limited number of such groups, such that the depth of the binary search is limited.
Every group is smaller than 256 characters, such that eight bits can be used to represent the SIZE in the quadruple.
The GAP is smaller than 256 characters, such that eight bits can be used to represent the GAP in the quadruple.
FIG. 1 is a table showing Unicode character codes for basic Latin characters;
FIG. 2 is a table showing Unicode character codes for Latin-1 supplement characters;
FIG. 3 is a table showing Unicode character codes for Latin extended-A characters;
FIG. 4 is a block schematic diagram showing operation of a compact array;
FIG. 5 is a block schematic diagram of a method and apparatus using a list of  less than minimum, size, gap, offset greater than  quadruple to encode Unicode characters in an upper/lower case character mapping according to the invention;
FIG. 6 is a flow diagram of a binary tree of SSGO record algorithm according to the invention;
FIG. 7 is a flow diagram of a method and apparatus using a list of  less than minimum, size, gap, offset greater than  quadruple to encode Unicode characters in an upper/lower case character mapping according to the invention;
FIG. 8 is a flow diagram of an optimization on ASCII range characters according to the invention;
FIG. 9 is a flow diagram of an optimization on character block without case information according to the invention; and
FIG. 10 is a block schematic diagram of a cache for optimizing a binary tree of SSGO record result according to the invention.