In recent years, XML (extensible Markup Language) has attracted attention as a technique for structuring various sorts of data and handling the structure data in a unifying manner. In XML, use of UTF8 (8-bit UCS Transformation Format) which is a character coding system for handling in a unifying manner characters used in various countries in the world is recommended. In UTF8, each of alphabetic characters expected to be used frequently can be expressed by one byte, while each of Japanese characters is expressed by about three bytes. Thus, the data size of characters in UTF8 varies depending on the kind of the character, etc.
In recent years, many APIs for efficiently analyzing and editing XML documents have been prepared in the Java (Registered Trademark) language. Java is a high level programming language developed by Sun Microsystems, originally designed for handheld devices and set-top boxes, modified to take advantage of the World Wide Web. Java is a general purpose object oriented language of the Web. In the Java (Registered Trademark) language, however, characters are ordinarily handled as data in UTF16 (16-bit UCS Transformation Format). Therefore, a procedure for converting from UTF8 to UTF16 is required for manipulating an XML document by a program written in the java (Registered Trademark) language. Further, a procedure for converting from UTF16 to UTF8 is required for enabling characters processed by a program written in the Java (Registered Trademark) language to be outputted as an XML document.
The following documents are considered:                [Non-patent document 1] Internet URL “http://cvs.apache.org/viewcvs.cgi/xml-xerces/java/src/org/apache/xerces/impl/io/UTF8Reader.java?rev=1.7&content-type=text/vnd.viewcvs-markup”        [Non-patent document 2] S. Makino, K. Tamura, T. Imamura, and Y. Nakamura. Implementation and Performance of WS-Security, IBM Research Report RT0546, 2003.        [Non-patent document 3] J. Knoop, O. Ruthing, and B. Steffen. Optimal code motion: theory & practice. ACM TOPLAS, 18(3):300-324, 1996.        [Non-patent document 4] R. Bodik, R. Gupta, and M. L Soffa. Complete removal of redundant expressions. In proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, pages 1-14, 1998.        
Conventionally, a technique for converting a string of UTF8 characters successively arranged into a string of UTF16 characters is used (see non-patent document 1). In this technique, UTF8 characters are read out from a string one after another, the data length and the kind of the read character are determined, and one of different conversion procedures is performed according to the determination results. Also, a library program for manipulating UTF8 characters without converting them into UTF16 in the Java (Registered Trademark) language has been proposed (see non-patent document 2).
Description will be made below of non-patent documents 3 and 4.
The following is a disclosure of the invention and problems to be solved by the invention
According to the technique described in non-patent document 2, procedures for converting UTF8 character s into UTF16 characters can be eliminated. However, many APIs for manipulating UTF16 characters have already been developed extensively, and it is not possible to efficiently develop programs by using the existing APIs. Also, in some cases, the efficiency of processing on UTF8 is lower than that of processing on UTF16. Therefore, the first challenge is to improve the efficiency of a procedure for converting from UTF8 to UTF16 while effectively utilizing the APIs already developed.
If the technique described in non-patent document 1 is used, a string of UTF8 characters can be suitably converted into a string of UTF16 characters. However, a procedure in which all characters to be manipulated are converted into UTF16 is not advantageous in terms of efficiency. For example, in a case where characters input in UTF8 are output in UTF8, procedure redundancy may occur such that conversion from UTF8 to UTF16 is made and conversion for returning to UTF8 is thereafter made. Therefore, the second challenge is to suitably select characters to be converted.
According to the example of the program in non-patent document 1, different procedures to be selected according to conditions including the UTF8 data size are required for converting UTF8 characters into UTF16 characters. Conditional branch instructions to make determinations with respect to conditions including the UTF8 data size are therefore generated. In central processing units, there is a possibility of an instruction pipeline being flushed by a conditional branch instruction to cause a reduction in processing efficiency. This is an undesirable phenomenon. Therefore, the third challenge is to minimize the occurrence of conditional branches in a conversion procedure.