This invention relates to the handling of language uncertainty in processing search queries and searches over a corpus including documents and other searchable resources, where the queries and resources can be expressed in any one of a number of different languages.
A search engine indexes documents and provides a means to search for documents whose contents are indexed by the search engine. Documents are written in many different languages; some documents have content in multiple languages. A variety of characters are used to express the words of these languages: the Latin alphabet (i.e., the 26 unaccented characters from A to Z, upper and lower case), diacritics (i.e., accented characters), ligatures (e.g., AE, β, CE), Cyrillic characters and others.
Unfortunately the ability and ease of producing these characters varies greatly from device to device. Both the authors of content and the users of search engines may not be able to produce conveniently characters that they would prefer. Instead, users of such devices will often provide a character or character sequence that is a close substitute. For example, AE may be provided in lieu of AE. Moreover the conventions of such substitutions vary among languages and users. For example, some users who search for AE may prefer to see results including AE as well.
One approach for addressing this issue in a search engine is to process the indexed content to remove accents and convert special characters into a standard set of characters. This approach removes information from the index, making it impossible to retrieve only specific accented instances of a word. This approach also suffers from language agnosticism which is insensitive to users whose expectations are shaped by the conventions of their particular language.