One of the fastest growing applications on the Internet is the world-wide web (WWW). The WWW is a collection of networked computers which exchange pages of hyper-text using the TCP/IP protocol. These pages may contain combinations of text, images and sounds, each of which may be either dynamic or static. Hyper-text is also called hyper-media or hyper-links. In addition, these pages may provide various methods of data input, for example, fill-in forms. In the context of the WWW, the pages are also called documents. The computers may be roughly divided into two main classes, clients and servers. The pages are usually downloaded from the servers by a client, using a specialized program called a “browser”. In some cases, the client enters data onto a page, and transmit this data to a server. This data is usually used to find new pages for the client to download. Alternatively to storing pages on a server, it is becoming a common practice to generate WWW pages on-the-fly at the server, using special programs.
There exist several additional classes of computers, including, search engines, which provide a list of pages on servers which relate to a particular search; proxy servers, which broker communications between clients and servers, for example, by locally storing frequently read pages; and gateways, which connect whole networks to the Internet. A rapidly growing subset of the “client” class of computer is the network computer, which is a specialized computer which is especially designed for connection to the Internet. Included in this sub-class are also Internet telephones and Internet TVs, all of which are not general purpose computers and have their Internet support hard-wired rather than programmed in software.
One of the greatest obstacles to the continued expansion of the WWW is the multi-lingual aspect of the data transmitted, which is compounded by language limitations of users. Currently, most of the pages in the WWW are written in English and most of the browsers and the servers are designed mainly for use with the English language. This situation is equivalent to having a telephone system which can only transmit words in English and a TV system which can only transmit programs in English.
Multi-lingual computer applications are known, for example multi-lingual word processors and even multi-lingual operating systems. However, unlike the Internet, in a computer application the system developer enforces a single standard of language representation and handling. In the Internet, there is no single system developer and it is not possible to enforce a single standard worldwide. Furthermore, there may be multiple standards in a single country. For example, in Japan there are three common character code set encodings for the Japanese language; in Israel, there are several common character sets and three different standards for display and input of textual information. There also exist many variants of the display standards in Israel. It should be appreciated that for many aspects of multi-lingual language support there is no common denominator between the different standards.
The Internet publication “The Multilingual World Wide Web”, written by Gavin Nicol in November 1994, and currently found at the URL: “http://www.sil.org:80/sgml/nicol-multwww.html”, describes four main failure modes of multi-lingual computer applications and discusses their relevance to the WWW. The first failure mode is related to data representation, i.e., how textual data is represented and how individual characters are encoded. As noted above, there are three such encoding standards in Japan and several in Israel. Further, the same character code may be used for different glyphs depending on the language and on the character set.
The second failure mode is related to data manipulation, where a given program cannot manipulate multi-lingual data. Some browsers do not support fonts which require more than 8 bits for encoding. Unicode, for example, requires 16 bits. None of the leading browsers are designed to support variable width (in bits) character codes.
The third failure mode is data display. It should be noted that in many languages, such as Arabic, the glyph form of a letter is dependent on the surrounding letters. This requires various display algorithms. In addition, the number of languages and fonts in the world are much greater than the number usually stored in a client computer, especially if it is a specialized network computer. Also, when using some browsers it is not possible to simultaneously display more than one language at a time (in addition to English).
The fourth failure mode is related to data input. One issue is keyboard mapping assuming—that a browser supports the font of the language used by the server, how should the browser map keystrokes to the individual glyphs. Many languages, such as Russian, require more than the standard 26 letters of English. Another issue is support for bi-directional data input. Some languages, for example, Hebrew and Arabic, are written from right to left (RTL) rather than from left to right (LTR), as English is. Other, oriental, languages are written in a vertical orientation.
There are several problems unique to bi-directional languages. Even when the language is written RTL, numbers are (usually, but not in all “standards”) written LTR. In addition, the text may be stored in a “logical” manner, where the first stored letter is usually the rightmost letter. Alternatively, the text may be stored in a “visual” manner, where the first stored letter is the leftmost letter, which in a multi-line text is located in the middle of the text. Thus, visually stored data is displayed LTR (with an appropriate font), while logically stored data must be displayed on a letter-by-letter basis—LTR letters displayed one way and RTL letters displayed in another way. It is a common practice to mix visual and logical representations in a single WWW page. This is particularly true for input. The input is most conveniently made using a logical representation, even though the data may be stored using a visual representation.
These above problems are compounded when viewed in the context of the WWW. One example of such a problem relates to search engines. Search engines automatically assimilate the contents of many WWW pages and allow a client to search these pages using various methods. If a page is stored using a visual representation, a search using keywords entered using a logical representation will not find the page. Of course, if the character set encoding is different, the page will not be found either. Another example, also relating to search engines arises in languages where there is more than one legal way to spell a word. This is common in various dialects of English, but in Thai, there is a lexical equivalence between various orderings of certain three-letter groups. Since search engines are inherently global, enforcing a single standard is practically impossible.
Another example of a compound problem is the use of multiple standards and/or languages in a single WWW page. Another compound problem is translating between units of measurements and ways of writing dates and times. For example, “1/6/1999” represents Jan. 6, 1999 in the U.S. and Jun. 1, 1999 in Europe.
To make matters worse, even the standard language of the WWW pages, HTML (Hyper-Text Meta Language) is not uniform around the world.
As a direct result of these problems, the “global village” has not yet arrived. One pointed example can be seen in Israel. At the time of this writing, Israel is one of the world industrial leaders in most Internet applications. However, the penetration of the Internet into the public sector is substantially retarded as compared to the U.S., even though a higher percentage of households in Israel own a computer with a modem than in the U.S.
An obvious solution would be to adapt the clients and servers in the Internet so that they support multiple languages. In particular, automatic WWW page generators will also have to be modified. In addition, such adaptation will probably require modifications to development environments. The amount of work required for this type of adaptation is enormous, since every existing browsing software and/or hardware would have to be adapted, a single standard would have to be enforced and all new applications would be limited by having to support a great number of languages and standards. This would be contrary to the concept of network computers: providing only the minimal hardware and software for surfing the WWW. For this reason, among others, most “multi-lingual” solutions support only one language in addition to English. In many cases, the languages supported are not the two which are desired.
In an attempt to solve the problem of multi-lingual searching, a web site has been constructed in which a client enters search terms in one language (Hebrew) and the search engine translates the words to English and applies the translated words to one of a limited number of existing search engines. The input is entered using Latin characters, which the web site maps to Hebrew characters after the input process is finished.
In yet another attempt, a web site has been created in which a JavaScript code segment is included in a WWW page, which displays a virtual keyboard in the desired language and which allows a user to click on keys. Each click adds a letter to a text object. The input from the user is directed only to the web site and for use of the programs therein and does not allow communication with other web sites.
Several solutions for the problem of display of multi-lingual pages have been suggested and/or tried. The Microsoft Internet Explorer version 3.01, Hebrew version, uses meta-tags in the WWW page to indicate whether a text object uses visual encoding or logical encoding. This information is used to drive display algorithms for the text object.
In the above referenced WWW publication and in “Summary of K12 activities in Japan”, by Kunio Goto and Masaya Nakayama, URL “http://k12jain.ad.jp/inet95.html”, a conversion server is suggested for use in Japan. The server is suggested for use as a proxy server and it replaces character codes from one standard set with codes from another set. This replacement is on a letter by letter basis.
In one system, “Internet with an Accent”, published by Accent Software Ltd., Israel, multilingual pages are developed using a special development environment provided with the package. The pages are then stored in a special format. The client must either be provided with a special browser or with a plug-in to his existing browser. This package has the capability of automatically displaying pages in one of several languages based on the setting at the client. However, this package only works if both the client and the developer use the “Accent” package.
The common denominator to all of the above solutions is that they require changes to at least one, and usually at least two, of the client, the server and/or the development environment. As a direct result, the accessibility of advanced and newly developed features (for non-multi-lingual applications) is retarded. In addition, the above solutions are not easily portable to newly developed systems.