The present invention relates to using data in determining a similarity between two words. More specifically, the present invention relates to processing noisy data in determining the similarity.
In natural language processing, there are many applications which determine word similarity. That is, many applications require that the similarity between different words be determined, for a variety of different reasons.
A fairly straightforward example of a natural language processing system that determines word similarity is a thesaurus builder. In order to build a thesaurus, the natural language processing system receives an input word and finds a plurality of similar words, which have generally the same meaning as the input word. This is repeated for a variety of different input words and the thesaurus is built using the identified similar words.
Another example, of an application that determines word similarity is machine translation. Machine translation is the process of receiving a textual output in a first language and translating it to a textual output in a second language. Machine translators sometimes use a thesaurus or other data store to find similarity between two different words.
Another example where word similarity is used is information retrieval. In information retrieval systems, a first textual input (sometimes referred to as a query) is received by an information retrieval system. The information retrieval system then executes the query against a database to return documents which are relevant to the query. In executing the query against the database, it is not uncommon for the query to be expanded. In order to expand the query, the information retrieval system identifies the content words in the query and attempts to find words having a similar meaning to the content words. The similar words are then added to the query to create an expanded query, and that expanded query is then executed against the database.
In calculating similarity between words, many natural language processing systems use structured or annotated data. For example, in automated word classification systems, certain linguistic dependency structures are used to represent the contexts of the words to be classified. The structured linguistic data is used because it reveals the deeper syntactic and semantic relationships between words in a sentence.
One specific embodiment of structured data is a dependency triple. Examples of dependency triples are <verb,OBJ,noun> and <noun,ATTRIB,adjective>, etc. Such dependency triples indicate the syntactic and semantic relationships between words in a given sentence.
The triples (or other dependency structures) are generated using existing text parsers. One known way for generating such dependency structures is set out in U.S. Pat. No. 5,966,686, issued Oct. 12, 1999, entitled METHOD AND SYSTEM FOR COMPUTING SEMANTIC LOGICAL FORMS FROM SYNTAX TREES. Of course, a wide variety of other techniques are also known for generating different types of dependency structures.
One drawback with such systems is that conventional parsers tend to generate dependency structures (such as the dependency triples mentioned above) that are incorrect. The parsed data which includes erroneous dependency structures is referred to as “noisy” data.
There have been a variety of different techniques attempted in the past in order to deal with noisy data. One traditional method for handling noisy data is to count the number of occurrences of the dependency structure in the training data. Dependency structures which have a number of occurrences which fall below a certain threshold level are simply assumed to be erroneous and are eliminated. The basic assumption behind this method is that low frequency dependency structures will more likely occur by chance, and are thus more likely to be wrong.
However, this method of handling noisy data does have disadvantages. For example, the parsed data will very likely have a large number of correct dependency structures which occur very infrequently. If all low frequency dependency structures are eliminated regardless of whether they are correct, a large amount of data will be lost. Thus, the technique may increase the precision rate of correct dependency structures in the parsed data set, but the recall of correct dependency structures will definitely decrease.
Another disadvantage of the prior technique of handling noisy data involves data sparseness. The parsed data is often sparse to begin with. Eliminating a large number of dependency structures simply because they occur relatively infrequently exacerbates the data sparseness problem.
Yet another disadvantage of eliminating low frequency dependency structures is that many of them are correctly parsed dependency structures. Therefore, not only does filtering out the low frequency dependency structures eliminate a large amount of data, it in fact eliminates a large amount of correct data.