The World Wide Web (WWW) is a set of protocols that allow a user to download and upload pages of information between his computer and other computers, typically using a program called a browser. The usual mode of operation includes opening a browser, entering a URL (Uniform Resource Locator), and viewing the page fetched by the browser. The actual pages of information are located on physical host machines, each of which may be mapped to one or more domain names. Typically each domain is served by one host machine.
URL syntax is described in RFC 1630 (“Uniform Resource Identifiers in WWW”). The URL syntax relies heavily on the domain name space, as defined in RFCs 1034 (“Domain Names—Concepts and Facilities”), 1035 (“Domain Names Implementation and Specification”) and 883 (Domain names—Implementation and Specification”).
A network resource (host) is identified in the domain name space by a string containing 1 or more labels (each up to a maximum of 63 characters), separated by periods. The periods are intended to define and outline the hierarchical structure of domain name space. Although RFC 1034 permits the use of 8-bit binary encoding, it is suggested that applications use 7 bit ASCII for naming. Further, the suggested and currently implemented (de facto) naming scheme uses labels consisting only of alphanumeric characters from the Latin (ISO Latin 1) Character set plus the hyphen character. A valid name must start with a letter and the rest of the name should contain only letters, digits or hyphens.
Thus, the naming conventions for domains (and consequently sites and URLs) are rather restricted. Typically, there is an attempt to identify a particular site with a particular site owner, so that the address is meaningful. For example, IBM has a web site with the address “http://www.ibm.com” (“.com” indicates commercial), Microsoft has the address “http://www.microsoft.com” but Microsoft Network has the address of “http://www.msn.com”. The restrictions make it easy to create a one-to-one mapping between web addresses and a particular site. However, these addresses must be entered accurately. Any mistake will result in the site not being located.
In many countries, English is not a native tongue. Meaningful WWW addresses in such countries are typically created by transliterating the name of the site owner into Latin letters. Unfortunately, many languages do not have an accepted and widely known standard of transliteration. Thus, there may be several plausible transliterations for a single name, resulting in several possible meaningful addresses, only one of which is correct.
Another problem is that the current address name scheme is not user friendly. First, in countries in which most people are not English speaking, the use of Latin letters and/or English spelling conventions may be a burden to many users, especially non-experienced users. In addition, in many cases there is no direct relationship between the name of the site owner and the address of his site. Guessing the address is typically not an option. Further, in countries where the name is transliterated, even if a meaningful address is created (such as for IBM, above) there is still no guarantee that a casual user will correctly transliterate that name from his native language. In many cases, the site addresses can be used as mnemonics, i.e., once the address is known, its content makes it easy to remember. However, it is often impossible to reconstruct the correct address from the name of the site owner.
For these and other reasons, search engines and WWW directories have been developed, in which a user enters a name and/or other information regarding the site owner and a WWW page containing a list of possible site addresses is generated and presented to the user. Some search engines allow the entry of non-Latin characters. In addition, various automated agents and SearchBots have been developed which serve as online search agents and which interface directly with the browser, for example, the WebTurbo software. In some browsers, an incorrectly entered name will automatically pull up a search page.
Some Web browsers allow a user to maintain a local list of preferred locations, which are stored and accessed by selection of a nickname and/or a description from a list, rather than by entering a complete URL. In some browsers, an incompletely typed URL may be automatically expanded by the addition of a standard suffix or postfix. Another helpful feature is automatic completion of URLs. If a URL has been previously used, entering the first few characters thereof will cause the entire URL to be suggested to a user.
The underlying addressing system in the Internet is based on numeric strings. However, in order to provide some measure of comfort, textual addresses, as described above, are used. A DNS (Domain Name Server) is a distributed application that translates textual addresses into numeric addresses. If the address is incorrectly formatted or incorrectly entered, it does not generate a proper numeric address. Rather it returns an output which generates an error message at the requester. The different DNS servers update each other with new mappings of textual addresses to numeric addresses.
Many network systems supply aliasing support and/or “hosts” files that contain associations between numeric strings and textual strings. In some systems, for example Microsoft Windows 95 with Hebrew Support, it is possible to enter and use (on the network, not on an external DNS) a host name including non-Latin characters. It should be noted that host names are also limited, for example, they cannot contain spaces.
M. Duerst, in WWW document “http://www.w3.org/international/draft-duerst-dns-i18n-00.txt” (a working draft), suggests introducing a new zero-level domain to allow the use of arbitrary characters from the Universal Character Set (ISO 10646), also known as Unicode, in domain names. Duerst suggests an implementation in which software with an internationalized user interface, such as a web browser will be responsible for conversions. The software would analyze the domain name, call the (DNS) resolver directly if the domain name conforms to the domain name syntax restrictions and otherwise encode the name according to the specifications described in the document. Duerst also suggests providing a separate look up service that programs will call if a domain name contains characters outside the allowed range. Francois Yergeau, in WWW document “http://www.alis.com:8085/˜yergeau/url-00.html”, suggests an 8-bit encoding for the Unicode, called UTF-8 (UCS Transformation Format 8), which preserves the full US-ASCII range, so that it is compatible with file systems, parsers and other software which relay on US-ASCII values but are transparent to other (8-bit) values.