1. Field of the Invention
The invention generally relates to a method and apparatus for generating normalized representations of strings, e.g. sentences, and in particular to a method for providing translation information for translating a string from a first language to a second language.
2. Description of the Related Art
A plurality of applications make use of normalized representations of strings, e.g. sentences, in particular applications in the area of translation memory, authoring memory, bilingual authoring memory, indexing, etc. An important application of normalized representations is translation memories in translation systems. These translation memories store linguistically-based normalized representations of text. Translation memory repositories collect segments of texts such as sentences or technical terms associated with a translation into some target languages. Such repositories give human translators an immediate access to translations that have been previously recorded. This reduces the effort, time and costs of translations, while improving its consistency.
The capability of translation memories can be expanded through fuzzy matching, a technique that matches input segments yet to be translated with segments stored in the translation memory, even if they are not identical to the input segments. Typical measures to allow fuzzy matching during the matching process may be ignoring a predefined set of words such as articles, conjunctions, etc or ignoring a set of predefined symbols, in particular punctuation marks. Furthermore, upper case and lower case characters or specific expressions such as numerical expressions may be normalized. During a matching step of a retrieving processing string segments (as ordered sequences of characters) may be used regardless of their linguistic structure and a mismatch of a certain number of characters may be allowed.
Retrieval systems serve to retrieve those texts or text portions that are relevant to the information needs of a user. In general, the relevant information contained in texts is constructed and extracted according to a normalized representation. Such a representation is abstracted away from its original linguistic form. Database queries of a user are generally subjected to processing in order to expand the scope of the query and/or to interpret the query syntax. The extracted query information is then matched against the stored representations in order to retrieve the specific information contained in a text. That or those text units which are the most similar to a query are output as retrieved text units.
For evaluating retrieval performance of information retrieval systems, two criteria are used, namely the “calling rate” and the “precision” these criteria are based on the subjective point of view on the relevance of retrieved information. The “calling rate” or “recall” and the precision are defined as follows.
The recall is a ratio of a number of relevant retrieved text units to the total number of relevant text units stored in the database, the precision is a ratio of a number of relevant text units to the number of retrieved text units. There is usually a trade-off between these two criteria. In information retrieval, it is desirable that these two criteria are in proximity to the maximum value of one.