With the increasing amount of information that is available to users via today's computer systems, efficient techniques for locating information of interest are becoming essential. To expedite the process of searching and retrieving relevant information, it is a common practice to create an index of the searchable information that is available from various sources. For instance, if a collection of documents are to be searched for information, the documents are first examined to identify terms of interest, and an index is created which associates each term with the document(s) in which it appears. Thereafter, when a user constructs a search query, the terms in that query are examined against the entries in the index, to locate the documents containing the requested terms.
Many search engines process the search results to calculate the relevance of each identified document to the query. For instance, a score can be calculated for each document, using a statistical technique that accounts for the number of query terms that are matched in the document, the frequency of each of those terms in the index, the frequency of each term in the document compared to the total number of terms in the document, and the like. Based upon these scores, the documents are displayed to the user in order of their relevance to the query. By means of such an approach, the query does not have to be a precisely constructed formula for finding only those documents which exactly match the terms of the query. Rather, it can be a list of words, or a natural language sentence.
Before a string of text from a document or other source of information can be indexed, it must be parsed into individual words. Preferably, the separated words are further processed to expedite the search and retrieval function. The process of separating a text string into individual words is known as tokenization. As a first step, the text is parsed into word tokens. A word token may or may not be a recognized word, i.e., a word which appears in a dictionary. After the word tokens have been identified, they are processed to eliminate those which do not serve as useful search terms.
A further process that can be carried out prior to indexing is known as “stemming”. In essence, stemming is the reduction of words to their grammatical stems. This process serves two primary purposes. First, it helps to reduce the size of the index, since all forms of a word are reduced to a single stem, and therefore require only one entry in the index. Second, retrieval is improved, since a query which uses one form of a word will find documents containing all of the different forms.
Ideally, the stemming processing is applied to all words that take different forms, and accounts for every possible form of each word. In this type of approach, stemming is highly language dependent. In the past, therefore, information search and retrieval systems which employed stemming were designed for a specific language. In particular, the rules that were used to reduce each word to its grammatical stem would typically apply to only one language, and could not be employed in connection with other languages. Consequently, a different search and retrieval mechanism had to be provided for each different language that might be encountered in the documents to be searched.
With the widespread accessibility of various information sources that is provided by today's computing environments, particularly when coupled with worldwide telecommunications facilities, such as the internet, any given source of information might contain documents in multiple different languages. Furthermore, it is not uncommon for a single document to contain text in more than one language. In these type of environments, it would be impractical to have to identify the language of a document, and then employ a different search and retrieval system for each different language that might be encountered. It is an objective of the present invention, therefore, to provide a mechanism for indexing and searching textual content which is generic to a plurality of different languages.