The invention relates generally to the field of art of database systems and more particularly to systems for generating, managing and operating databases in a multi-script environment.
Text is stored in computers in a wide variety of encodings. For instance, one of the earliest encodings is ASCII (American Standard Code for Information Interchange,) where alphanumeric characters are represented by a 7-bit numeric value. Thus, as illustrated by the ASCII encoding in Table 1 below, the character xe2x80x98Axe2x80x99 is represented by the 7-bit representation xe2x80x98100 0001xe2x80x99 (or hexidecimal value 0xc3x9741 as illustrated in table.)
Another character encoding is the 8-bit EBCDIC (Extended Binary Coded Decimal Interchange Code) utilized in traditional IBM Corporation mainframe computers, where alphanumeric characters are represented by an 8-bit numeric value. Thus, with reference to the EBCDIC encoding illustrated Table 2 below, the character xe2x80x98Axe2x80x99 is represented by the 8-bit representation xe2x80x981001 0001xe2x80x99 (or hexidecimal value 0xc3x9791 as illustrated in the table.) Notice that the EBCDIC alphanumeric encodings, illustrated in Table 2, are different from ASCII as illustrated in Table 1.
Historically, as illustrated by the ASCII and EBCDIC encodings, computer representations for text were entirely focused on English. Over time, new encodings were designed that allowed the representation of text in many languages with many different character sets. In this document, we use the term xe2x80x98scriptxe2x80x99 to refer to the representation of one or more languages in terms of a set of written character forms. An xe2x80x98encodingxe2x80x99 is a binary representation that allows text in one or more scripts to be encoded in the memory of a computer.
The many script encodings have grown in a disorganized process; they are not organized into a coherent system. In particular, they are not collectively self-descriptive. In other words, it is not possible to look at an arbitrary stream of data and determine what, if any, text encoding is in use. For example, if a data value of 0xc3x9751 representing a text character is received by a computer, and the ASCII and EBCDIC are possible encodings, if the computer does not know which script encoding is being used the computer cannot determine if the data value is referring to the ASCII character xe2x80x98Qxe2x80x99 or the EBCDIC character xe2x80x98axe2x80x99. The data value itself does not convey the script encoding utilized in creating the data value. In addition, in most of the encodings, it is not possible to include text in multiple scripts in the same logical document.
The existing art includes several standards that attempt to bring some order into this chaos. The ISO-8859 family of standards provides a series of one-byte-per-character encoding for European languages. Yet these encodings are not self-descriptive. The ISO-2022 standard attempts to allow for a complete, self-descriptive encoding that can be extended to cover all languages. However, this standard is so complex and unwieldy that it is never used in a full, self-descriptive, multi-script form. There are many other standards that provide encodings using one, two, or more bytes of data per character to represent text.
Unicode, standardized by ISO as ISO/IEC-10646-1: 1993, provides a representation that can store most of the commonly used languages in a single encoding. Unicode is self-descriptive, so that text encoded utilizing Unicode further includes information indicative of the script. Thus, Unicode overcomes many of the shortcomings of preexisting script encodings. But while Unicode is becoming widespread, there are serious difficulties in simultaneously accommodating currently existing non-Unicode information in many applications. In addition, many of the commonly installed computer systems do not even handle Unicode and can only handle a single encoding at a time. These legacy systems will be with us for a long time.
The problem of multiple encodings is traditionally addressed in software applications by building multiple versions of software systems, one per encoding. One version may be adapted to handle English based scripts while yet another version may handle Chinese based scripts. Each user interacts with the software using the encoding native to their particular computer system.
This model fails to cope with the needs of international business, particularly on the World Wide Web. In the emerging international marketplace, businesses need to present an interface to users in many languages and accept responses from them. Furthermore, as the marketplace becomes more global, the mechanism for information exchange between these divergent markets (and thus divergent computer systems) must be able to handle a wide variety of scripts. Until and unless the majority of users use Unicode-enabled systems and software, these business interfaces must cope with the existing inventory of text encodings.
This problem is particularly acute on the World Wide Web, where the standards for information exchange and presentation have well-known inadequacies in the area of character encodings. For instance, pages of information sent to users can be marked with an encoding so that the text may be correctly displayed. However, responses from the user to the business server are not marked with any encoding at all. This impairs the development of truly worldwide software applications.
As a result, the existing art offers no good means of taking user responses in an arbitrary text encoding and processing them. This problem is particularly acute for database lookups. While existing DBMS systems can store text in all the many national encodings (and, in some cases, Unicode), they provide no assistance for looking up a string in an unknown encoding.
Any solution to the problem of processing user responses in arbitrary encodings has to be compatible with existing databases.
It is a goal of the present invention to provide a system and method for handling multiple script encodings.
It is a further goal of the present invention to facilitate the use of multiple script encodings in software and database applications.
It is an additional goal of the present invention to provide a system and method for augmenting existing databases to handle multiple script encodings.
It is an additional object of the invention to provide a system and method to manage an Internet Protocol Domain Name Service utilizing a variety of script encodings.
It is a further object of the invention to modify existing Internet Protocol Domain Name Service databases to manage both legacy script encodings and additional non-legacy script encodings.
The present invention is directed to a database system and method for storing information referenced by a name encoded according to at least two scripts. The database system includes a first database containing first information pertaining to the name retrieved from the first database via a first key. The first key contains the name encoded in a first script. The database system further includes a second database containing second information pertaining to the name retrieved from the second database via a second key. The second key contains at least the name encoded in a second script.
The first script encoding may be a national encoding whereas the second script encoding may be a universal script encoding. The database system for claim 1 wherein said second information includes the name encoded in a third script. In addition, the second information may include the name encoded in a canonicalized form. The system may further be adapted where the first database further comprises a plurality of national encoding databases with a corresponding national encoding script, each national encoding database indexed by said name encoded in the national script corresponding to the national encoding database. The present invention further includes methods for constructing the databases.
The present invention is also directed to a domain name system for deriving host information pertain to a host name a first database containing first host information pertaining to said host name retrieved from the first database via a first key. The first key contains at least the host name encoded in a first script. The system further includes a second database containing second host information pertaining to the host name retrieved from said second database via a second key. The second key contains at least said host name encoded in a second script.
The domain name system is further adapted so that the first script encoding is a national script encoding and the second script encoding is a universal script encoding. The domain name system first host information may further include second key information and the second host information may, include information selected from the set comprising internet protocol address, domain server and computer host specifications.
The domain name system may further contain a plurality of national encoding databases, said the key further containing a plurality of national encoding keys containing at least said host name encoded in each of said scripts in a plurality of national script, encodings corresponding to each database of said plurality of national encoding databases. The present invention further includes methods for constructing the databases.