1. Field of the Invention
The invention generally relates to a method and a computer system for disambiguating a phrase in a linguistic system, and in particular to part-of-speech tagging.
2. Description of the Related Art
Several techniques have been developed for part-of-speech (POS) tagging. The function of a part-of-speech tagger is to associate each word or corresponding sub-unit in a text with an abstract morpho-syntactic category being represented by a tag. POS-tagged text is used in a variety of text manipulation processes, for example in a parser or syntactical analyzer allowing the recognition, extraction and normalization of semantic structures in the text. These structures may be used for text mining, indexing, understanding, and dialog systems.
In the following part-of-speech tags are for briefness also denoted as tags or POS-tags. The abstraction to general categories in a POS-tagger allows the creation of effective multilinguistic parsers, since text analysis rules can be described using a limited number of categories rather than using specific rules for each of the languages.
Typically a POS-tagger performs three functions:
1) Tokenization: breaking a stream of text characters into tokens,
2) Lexical lookup: providing all potential part-of-speech tags for each token, and
3) Disambiguation: assigning a single part-of-speech tag to each token.
In experimental settings, POS-taggers can attain correct assignment of POS-tags with a success rate of more than 95% accuracy, but these tests are usually performed on text comprising complete sentences. In real-world applications, however, documents often contain text composed of incomplete sentences: e.g. titles, lists of items, subheadings. Such phrases are often incorrectly tagged by POS-taggers.
Technical manuals typically comprise a list of instructions including words like “press”, “open” or “hold” as first tokens. These words are ambiguous since they exist in the lexicon as either nouns or verbs. If the phrase is short, e.g. “close the door” and the POS-tagger is not trained for grammatical structures beginning with a verb, the POS-tagger will not be able to disambiguate the phrase. For the phrase “Train Schedules”, being another example for a phrase meaning “time tables for trains”, in common POS-taggers one of both words would be identified to be a verb.
Common disambiguation methods usually lead to partial inaccurate results for short phrases. Therefore manual POS-tagging, corresponding to a user imitating the POS-tagger, often has to be performed.