The invention relates to the field of information processing, and more particularly to the matching of candidate character sets to the intended language of an electronic message containing a plurality of character sets.
With the use of the Internet, email and related electronic services, communications software has been increasingly called upon to handle data in a variety of formats. While the barriers to simple communications have been removed from many hardware implementations, the problem of operating system or application software being unable to display text in different languages remains.
For instance, a person browsing the World Wide Web may wish to input a search string in their native language. Some Web pages or search engines will simply accept that string in the form in which it was input, but not process the spelling, syntax or character set in native form. The search engine then performs a search as though the search were in English, usually resulting in no hits. Other Web pages may allow a user to manually specify the desired language for browsing and searching. There is a need for more robust and more highly automated language handling for general searching, messaging and other communications purposes.
The invention overcoming these and other problems in the art relates to a system and method whereby electronic messages coded in a universal character set such as Unicode or others can be reliably and accurately transmitted using standard conventional encoding methods over the Internet, or other networks. The encoded documents may be in MIME Multipurpose Internet Mail Extensions).
An object of the invention is to provide an automatic and rigorous language evaluation facility by which the content of a message represented in a universal character set is tested against a bank of available language character sets, to determine which if any of those candidate languages can express the message.
Another object of the invention is to provide a system and method for evaluating character sets which identify languages which are capable of expressing the message from the language bank, to present to a user or otherwise.
Another object of the invention is to provide a system and method for evaluating character sets which assign a rating to languages which can express a given message, to determine which of those candidate languages offers the best fit to express the message.
Another object of the invention is to provide a system and method for evaluating a character set which permit searching and reading of text expressions in their native character sets, improving the quality of search results.
The system and method of the invention accomplishing these and other objects employs a character table bank against which the ability of a number of character sets, representing different languages, to encode a given character is tested. When a message of unknown origin is presented to the system, its characters are parsed and tested against the character table bank to separate the character sets (hence languages) to identify which of the pool of character sets can express each character.
A character set which contains a match for every character of the message is likely to be the native language of the original message. Tallies of matches to individual characters across all available character sets in the character table bank can also be made for the message as a whole. The invention has been implemented in and will be described in one regard with respect to the Lotus Notesrm environment, but it will be understood that the invention has universal application and can be used in any system that needs to receive and display information in multiple languages.