The present invention relates generally to the field of computer search engines, and more particularly to automating multilingual indexing.
An index is an indirect shortcut derived from, and pointing into, a greater volume of values, data, information, or knowledge. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, a search engine may scan every document in a corpus, which may require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
If a search engine supports multiple languages, a common initial step during tokenization is to identify each document's language. A search engine uses the language for morphological analysis of search query text. Morphological analysis, as used herein, refers to the analysis of the structure of a given language's morphemes and other linguistic units, such as base forms, root words, affixes, parts of speech, intonations and stresses, or implied context, among others. For example, the same word written with Latin characters can have different base forms in two different European languages, as well as different meanings. The rules for a morphological analysis can be different for every language. Based on the detected languages from morphological analysis, the search engine can correctly find documents that contain words which do not appear in the user's query but have the same base form as the words in the user's query. A base form, as used herein, can refer to the primary lexical unit of a word family.
Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing.