The present invention is directed to the indexing and searching of text in documents for information retrieval purposes, and more particularly to an indexing and searching system that is capable of handling text in any of a plurality of languages.
With the increasing amount of information that is available to users via today""s computer systems, efficient techniques for locating information of interest are becoming essential. To expedite the process of searching and retrieving relevant information, it is a common practice to create an index of the searchable information that is available from various sources. For instance, if a collection of documents are to be searched for information, the documents are first examined to identify terms of interest, and an index is created which associates each term with the document(s) in which it appears. Thereafter, when a user constructs a search query, the terms in that query are examined against the entries in the index, to locate the documents containing the requested terms.
Many search engines process the search results to calculate the relevance of each identified document to the query. For instance, a score can be calculated for each document, using a statistical technique that accounts for the number of query terms that are matched in the document, the frequency of each of those terms in the index, the frequency of each term in the document compared to the total number of terms in the document, and the like. Based upon these scores, the documents are displayed to the user in order of their relevance to the query. By means of such an approach, the query does not have to be a precisely constructed formula for finding only those documents which exactly match the terms of the query. Rather, it can be a list of words, or a natural language sentence.
Before a string of text from a document or other source of information can be indexed, it must be parsed into individual words. Preferably, the separated words are further processed to expedite the search and retrieval function. The process of separating a text string into individual words is known as tokenization. As a first step, the text is parsed into word tokens. A word token may or may not be a recognized word, i.e., a word which appears in a dictionary. After the word tokens have been identified, they are processed to eliminate those which do not serve as useful search terms.
A further process that can be carried out prior to indexing is known as xe2x80x9cstemmingxe2x80x9d. In essence, stemming is the reduction of words to their grammatical stems. This process serves two primary purposes. First, it helps to reduce the size of the index, since all forms of a word are reduced to a single stem, and therefore require only one entry in the index. Second, retrieval is improved, since a query which uses one form of a word will find documents containing all of the different forms.
Ideally, the stemming processing is applied to all words that take different forms, and accounts for every possible form of each word. In this type of approach, stemming is highly language dependent. In the past, therefore, information search and retrieval systems which employed stemming were designed for a specific language. In particular, the rules that were used to reduce each word to its grammatical stem would typically apply to only one language, and could not be employed in connection with other languages. Consequently, a different search and retrieval mechanism had to be provided for each different language that might be encountered in the documents to be searched.
With the widespread accessibility of various information sources that is provided by today""s computing environments, particularly when coupled with worldwide telecommunications facilities, such as the internet, any given source of information might contain documents in multiple different languages. Furthermore, it is not uncommon for a single document to contain text in more than one language. In these type of environments, it would be impractical to have to identify the language of a document, and then employ a different search and retrieval system for each different language that might be encountered. It is an objective of the present invention, therefore, to provide a mechanism for indexing and searching textual content which is generic to a plurality of different languages.
In accordance with the present invention, a multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary for a given language. During the tokenization phase of the process, a string of text is separated into individual word tokens. Predetermined types of tokens, known as junk tokens and stop words, are eliminated from further processing. As a further step, characters with diacritical marks are converted into corresponding unmarked lower case letters, to eliminate match errors that might result from incorrectly accented words.
The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. To expedite the stemming process, as well as expand subsequent retrieval, the stemming process is not directed to finding the true grammatical root form of a word. Rather, a known word ending is removed without any effort to guarantee that the remaining stem actually appears in a dictionary. For instance, a vowel change that normally occurs within a word, as a result of the addition of an ending, is ignored during the stemming process.
As a further feature, the stemming process is limited to word endings that are associated with nouns. This aspect of the invention is based on the assumption that nouns are much more significant than verbs, in terms of informational content in a query. Consequently, the major processing effort is directed to nouns.
By means of these techniques, a uniform approach is provided for the tokenization and stemming of words across a variety of languages. Consequently, the search and retrieval engine can identify documents that may be relevant to the user""s query, regardless of the particular language(s) appearing in a given document.
Further features of the invention, and the advantages achieved thereby, are described in detail hereinafter with reference to specific embodiments illustrated in the accompanying figures.