With the rapid development of Internet and its business application, E-mail and its security has become more and more important. SMTP (Simple Mail Transfer Protocol) is the basic electronic mail transfer protocol. All the SMTP-based E-mail encrypting system PGP (Pretty Good Privacy), PEM (Privacy Enhanced Mail), and MIME (Multipurpose Internet Mail Extensions) or S/MIME (secure MIME) can provide compatibility with the E-mails. So-called compatibility with the E-mails is to transform arbitrary 8-bit data byte-strings or arbitrary bit stream data transferred by the E-mail into the character-strings of a limited ASCII (American Standard Code for Information Interchange). The main limitation on the latter is that: (1) the characters have to be printable; (2) the characters are not control character or “-“ (hyphen). There are totally 94 of such ASCII characters, their corresponding digital coding being all integers ranging from 32 through 126 with the exception of 45. E-mails written in these ASCII characters are compatible with the Internet standard SMTP, and can be transferred in nearly all the E-mail systems. Nowadays, to provide compatibility with the E-mail, Base64 coding or QP (Quoted-Printable) coding is usually employed.
Base64 coding divides the input message M into blocks 6-bit long to be used as variable to implement mapping, the mapping is denoted by                Base64[ ]:X→Y wherein the variable or original image set X includes all 64 6-bit long symbols (denoted as integers 0, 1, . . . , 63) and Φ representing “no data”; the image set Y includes the upper and lower cases of 26 alphabetic characters, Arabic digits ranging from 0 through 9, “+”, “/” and filling character “=” wherein it is specified that in the non-program statements the Chinese quotation marks are used as the delimiter of characters or character-strings (the following is the same). Mapping rules commonly used in Base64 coding software are        Base64[0]=“A”, . . . , Base64[25]=“Z”, Base64[26]=“a”, . . . , Base64[51]=“z”, Base64[52]=“0”, . . . , Base64[61]=“9”, Base64[62]=“+”, Base64[63]=“/”Particularly, Base64[Φ]=“=” is used only when needed so as to make the total number of characters of output of the transformation equal to the multiples of 4. The coding efficiency of Base64 coding is {fraction (6/8)}=75%, the data expansion rate is {fraction (8/6)}={fraction (4/3)}=133.33%.        
QP coding divides the input message M into blocks 8-bit long to be used as variable to implement mapping, when the original image 8-bit data is non-“=” printable character, its image equal to the original image (i.e. there is no change); when the hexadecimal notation of the original image 8-bit data is “LR” and the most significant bit is 1, its image is three printable characters “=LR”; while the image of “ ” is “=3D”. Hence, in the worst case, the coding efficiency of QP transformation is ⅓ and the data expansion rate is 300%, (this is the case when Chinese data employing coding GB2312 are being QP-transformed).
Contents of Invention
The object of the present invention is to provide a digital data transformation method to replace Base64 coding or QP coding, so as to provide higher coding efficiency under the condition of E-mail compatibility, to reduce the time requirement for transferring coded messages over the network, or to save storage space when the data are stored using printable character mode.
The present invention will be implemented by the following technical design: the coding transformation of arbitrary bit stream data into printable character sequence. The main idea is: to increase the bit length of the block mapping of the input message M from the current 6 or 8 bits to 13 bits, and to use the double-character set of 91 printable ASCII characters as the image set for the transformation. The followings are the Base91 coding designed for the present invention (also denoted as Radix-91 coding, wherein Base91 and Radix-91 are two conventional English names of “base number-91”).
Base91 coding divides the input message M into blocks 13-bit long to be used as variable to implement mapping, the mapping is denoted by                Base91 [ ]:X→Ywherein the variable or original image set X includes all 8192 13-bit long symbols (denoted as integers 0, 1, . . . , 8191) and symbols On (n=1, . . . ,12), φ1=8192, . . . , φ12=8203, denoting that the n-bit data at the specified side of the last block are used as the filling data, thereby making the total number of elements in the original image set equal to 8204; the image set Y is the sub-set of the direct product of R91×R91, wherein the symbol R91 denotes the set of 91 characters selected from the 95 printable ASCII character set with “−”, “=”, “.” and space character excluded, the direct product R91×R91 has 8281 elements.        
Base91 is defined as an injective mapping arbitrarily selected from X into the direct product R91×R91. The selection of any particular injective mapping as Base91 has no effect on the present invention. For the convenience of implementation, assuming that R91_CH[91] is the character set that includes all R91 characters and is arranged according to the ASCII sequential order, the present invention preferably selects the following mapping:Base91[x]=(ch1, ch2)=(R91—CH[x/91], R91—CH[x%91])  (1) wherein xεX, ch1,ch2 ε R91, symbols “/” and “%” are the operators used in the C language, representing integral division and modulo division (remainder) respectively.
The operation of dividing the input message M into 13-bit long blocks may produce the last block less than 13-bit long. For such blocks, n bits are added to the specified side to make it become a complete block for implementing mapping; and a block of data φn (n=1, . . . ,12) is added thereafter as the input data implementing mapping so that it can be decided how many filling bits have to be deleted during decoding. When needed, double-character “==” may be used as a “terminating symbol” of the output character-string. Hence at most 92 printable ASCII characters can appear in the output of Base91 coding.
According to the coding rules of Base91 coding mentioned above, the number of extra added output data consisting of the image of the filling bits and their denoting symbols, and the “terminating symbol” does not exceed 6 characters. Therefore, with the increase of the bit number or byte number of the input message M, the average coding efficiency of the Base91 coding designed in the present invention approaches 81.25%, its data expansion rate approaches 123% (the coding efficiency of current Base64 coding is 75% and its data expansion rate is 133%).
Compared with the Base64 coding and QP coding, the present invention has its distinguishing feature in that the variable bit number of the block mapping of the transformation exceeds 6 or 8 and is not a multiple of 6 or 8, it is a particularly selected number 13. The design features of the three kinds of coding transformation are shown in Table 1.
Compared with current Base64 coding and QP coding, the present invention obviously increases the coding efficiency. When used in transferring information, the present invention can reduce channel occupation time, save transmission cost; when an arbitrary bit-string data is stored using printable character mode, the present invention can save storage space and cost. The comparison of the transformation performance is shown in Table 2.
TABLE 1coding transformationQP codingdesign(MSB of inputBase64Base91featuresbyte is 1)codingcodingnumber of basic 8 613variable bitsnumber of bits 24 816occupied by imageelementcharacteristic of1 byte/3 byteimage isimage ismessage block orsingle bytedouble bytesimagenumber of output 176591 or 92various charactersnumber of elements of25664213 + 12variable set
TABLE 2coding transformationQP codingperformance(MSB of inputBase64Base91featuresbyte is 1)codingcodingcoding efficiency33.333% 75%81.25%data expansion rate  300%133%  123%time required for22510092.3E-mail transmission of10044.4441.03coded data with equalamount of messagesstorage space for22510092.3coded data of equalamount of messages inprintable charactermode
The (equal amount of) “messages” in Table 2 denotes the input of coding transformation, the data in the third row (“time required for E-mail transmission of coded data with equal amount of messages”) means the results of calculation according to the coding method itself without considering other time overhead required for processing E-mails during a concrete network transmission.