The present invention relates to a method of and an apparatus for forming an index. The invention also relates to a storage medium storing a program for performing the method, an index, a storage medium containing the index and the use of the index to access documents.
The techniques disclosed herein may be used for information management. Examples of such applications include information retrieval systems, such as search engines, for accessing information on the internet or in office information systems, information filtering applications (also known as information routing systems) and information extraction applications.
There are many data bases which contain documents in machine-readable form and which can be accessed to locate and retrieve information. Similarly, there are various known techniques for locating documents on the basis of subject matter. One example of this is the collection of published patent specifications. All patent specifications are indexed according to subject matter when the specification is published in accordance with the International Classification. The content of each patent specification is analyzed in accordance with the International Classification and the relevant classification numbers for the subject matter form part of the heading of both the printed patent specification and he machine-readable form.
In order to locate patent specifications, or indeed other documents, whose collections are similarly classified according to subject matter, it is necessary to select the correct international class and to apply this to a searching system. The searching system then locates all patent specifications which have been classified in the same class. However, a disadvantage of this system is that efficient use requires familiarity with and experience of using the International Classification system. Also, this technique relies on correct classification of patent specifications. Inexperienced use can result in relevant patent specifications being missed whereas incorrect classification can prevent a relevant patent specification from ever being located by this technique.
Another known technique for information retrieval relies on the selection of keywords which are then used to search for relevant documents such as patent specifications. In this case, it is necessary to identify words which are likely to appear in the relevant documents but which are unlikely to appear in irrelevant documents. Searching using keywords than reveals all documents which contain the keywords or combination of keywords.
There are several difficulties with this technique. For instance, in the case of subject matter without well-defined or stand terminology it may be difficult or impossible to select all keywords which might identify relevant documents. On the other hand, the use of more general keywords can lead to the disclosure of very large numbers of documents many of which are irrelevant. Further, such keywords can only be used for documents which are in the same language or which have been completely or partially translated or abstracted into the language of the keywords. The effectiveness of this technique in locating documents in other languages may therefore be poor or nonexistent.
D. A. Hull and G. Greffenstette, xe2x80x9cQuerying across languages: a Dictionary-Based Approach to Multilingual Information Retrievalxe2x80x9d, 19th Annual International Conference on Research and Development in Information Retrieval (SIGIR ""96), pages 49-57, 1996 and D. W. Oard and B. J. Dorr, xe2x80x9cA Survey of Multilingual Text Retrievalxe2x80x9d, Technical Report UMIACS-TR-96-19, University of Maryland, institute for Advanced Computer Studies, April 1996, disclose two techniques for performing multilingual information retrieval, one based on document translation and the other based on query translation. In each case, each translation is to be performed by a machine translation system. Thus, in the case of document translation, a machine translation system is used to translate all of a collection of documents into a target language so that queries for locating and retrieving information, for instance based on the keyword technique described hereinbefore, may be performed in the source (document) language or is the target language. In the other technique, the documents are not translated but each query is translated into the source or document language and the translations are used to search the document collection.
A disadvantage with query translation is that queries often comprise a few words and may not even be in a sentence context. Thus, automatic linguistic processing of such queries can be difficult and may lead to unsatisfactory results, such as failure to locate relevant documents and location of irrelevant documents.
The use of automatic machine translation to translate whole collections of documents to form an index is also problematic. The resources required in terms of computing time and additional storage medium capacity make this technique unattractive. Although such processing need not be performed in real time and, in particular, is not required as part of each information retrieval request, substantial resources are necessary and there may be a continuing requirement as further documents are added to the collection. Translation into several target languages multiplies the resource requirements.
Machine translation systems also perform tasks which are not useful to information retrieval and, in particular, to the forming of a multilingual index. For instance, in addition to translating words and groups of words contained in documents, machine translation systems also attempt to produce a good quality translation which is readable for human beings. If the translation is merely required for indexing, functions such as correct word ordering in the target language are unnecessary and are therefore wasteful of computing resources.
A further disadvantage with machine translation systems when used to translate documents into a target language for indexing purposes is that the effectiveness of the index may be seriously compromised. Some machine translation systems generate a single preferred translation of an input text. In other words, such systems attempt to identify and produce a single translation which is judged according to automatic criteria within the system as the best translation. If that translation is incorrect, then retrieval of information based on the incorrect translation will be ineffective because relevant documents may not be located and irrelevant documents may be located.
Other machine translation systems attempt to generate all possible translations of input text. Thus, even if the correct translation is present, there may be many other translations which are inappropriate or wrong. The use of such translations for information retrieval results in the generation of spurious matches on queries posed to the system so that very large numbers of irrelevant documents may be located together with the relevant documents.
According to a first aspect of the invention, there is provided a method of forming, for a plurality of documents, an index comprising indexing features, the method comprising the steps of:
identifying each of at least some of the terms present in the documents;
generating from each identified term at least one equivalent term which is different from but linguistically related to the identified term; and
forming for each of the identified terms and the equivalent terms an indexing feature comprising the identified term or the equivalent term and an identifier of the or each document in which the identified term or the identified term to which the equivalent term is equivalent occurs.
The expression xe2x80x9ctermxe2x80x9d used herein means an individual word, a group of linked words which occur adjacent each other in a document (continuous collocation), or a group of words which are led to each other but which are divided into at least two subgroups of words separated in a document by one or more words which are not members of the group (non-continuous collocations).
The expression xe2x80x9cidentifierxe2x80x9d as used herein is any means for identifying one or more locations of a term, for instance a heading or arbitrary serial number of a document containing the term. The expression xe2x80x9cindexing featurexe2x80x9d as used herein means a term and an identifier.
The expression xe2x80x9clinguistically relatedxe2x80x9d as used herein means a term which has the same, a similar or related meaning. For instance, linguistically related terms include synonyms, more general terms and more specific terms in the same (natural) language and translations into a different (natural) language.
Although the documents may be in any type of language, such as a high level computer programming language, the documents are preferably natural language documents.
The generating step may comprise accessing a thesaurus with each identified term and the equivalent terms may be synonyms of the identified terms, more general terms than the identified terms and more specific terms than the identified terms.
The generating step may comprise accessing a multilingual resource with each identified term and the equivalent terms may be translations of the identified terms, more general terms than the identified terms and more specific terms than the identified terms.
The multilingual resource may comprise a glosser. The glosser may be a limited non-deterministic glosser. The glosser may form a plurality of translations of at least one of the identified terms and may assign to each translation a priority according to the likelihood of the translation being correct.
The multilingual resource may comprise a bilingual dictionary.
The multilingual resource may comprise a machine translation system.
The identifying step may be performed by a part of speech tagger.
According to a second aspect of the invention, there is provided an apparatus for forming, for a plurality of documents, an index comprising indexing features, the apparats comprising:
means for identifying each of at least some of the terms present in a document;
means for generating from each identified term at least one equivalent term which is different from but linguistically related to the identified term; and
means for forming for each of the identified terms and the equivalent terms an indexing feature comprising the identified term or the equivalent term and an identifier of the or each document in which the identified term or the identified term to which the equivalent term is equivalent occurs.
The generating means may comprise a thesaurus and the equivalent terms may be synonyms of the identified terms, more general terms than the identified terms and more specific terms than the identified terms.
The identifying means and the generating means may comprise a multilingual resource.
The multilingual resource may comprise a glosser. The glosser may be a limited non-deterministic glosser. The glosser may be arranged to form a plurality of translations of at least one of the identified term and to assign to each translation a priority according to the likelihood of the translation being correct.
The multilingual resource may comprise a machine translation system.
The generating means may comprise a bilingual dictionary.
The identifying means may comprise a part of speech tagger.
The apparatus may comprise a programmed data processor.
According to a third aspect of the invention, there is provided a storage medium containing a program for controlling a data processor to perform a method according to the first aspect of the invention.
According to a fourth aspect of the invention, there is provided an index formed by a method according to the first aspect of the invention or by an apparatus according to the second aspect of the invention.
According to a fifth aspect of the invention, there is provided a storage medium containing an index according to the fourth aspect of the invention.
According to a sixth aspect of the invention, there is provided use of an index according to the fourth aspect of the invention to access the documents.
It is thus possible to form an index to a collection of documents having indexing features which are not restricted to the terms which occur in the documents. By making use of the thesaurus entries, synonymic, more general and more specific terms may be added to the index to increase the likelihood that an arbitrary query will locate a relevant document during information retrieval. By using multilingual resources, indexing may be performed in an efficient and effective manner in languages other than the source or document language.
Although any type of multilingual resource may be used, light-weight cross-linguistic glossing systems have advantages. Such a glossing system uses limited non-determinism to generate plausible target language translations to be used in indexing features, Such glossing systems or glossers may be of the type disclosed in EP 0 813 160 and GB 2 314 183, the contents of which are incorporated herein by reference. This type of glosser is capable of identifying and translating sequential (continuous) and non-sequential (non-continuous) collocations which are indexed by a headword. Further, this system can be used to ascribe priorities to alternative translations in such a way that consistent translations of complete sections of text are always obtained irrespective of which of several translations of a word or collocation is in fact selected. Further, the prioritising of alternative translations allows a limited number of such translations to be used, for instance based on the priority information.
Such glossers are more efficient than machine translation systems. An index merely requires the identification and translation of terms and does not require other processing steps such as parsing and generation of a readable translation as provided by machine translation systems. Thus, the use of glossing is computationally more efficient in that substantially less computational time is required.
The use of a glosser can overcome the problems associated with selection by a machine translation system of a single most likely, but perhaps incorrect, translation and the selection of all possible translations including those which are incorrect and may be entirely inappropriate for indexing purposes. By using nondeterministic techniques, a limited number of most likely translations of the terms can be provided. There is a very high probability that this limited number of translations selected from all possible translations will include the best or correct translation. Accordingly, accessing documents using indexes formed in this way provides a high probability of locating all relevant documents while reducing the numbers of irrelevant documents which might otherwise be located.