1. Field of the Invention
The present invention relates to language translation systems. More particularly, the present invention relates to a method for reducing lexical ambiguity.
2. Background Information
With the continuing growth of multinational business dealings where the global economy brings together business people of all nationalities and with the ease and frequency of today""s travel between countries, the demand for a machine-aided interpersonal communication system that provides accurate near real-time language translation, whether in spoken or written form, is a compelling need. This system would relieve users of the need to possess specialized linguistic or translational knowledge.
A typical language translation system functions by using natural language processing. Natural language processing is generally concerned with the attempt to recognize a large pattern or sentence by decomposing it into small subpatterns according to linguistic rules. A natural language processing system uses considerable knowledge about the structure of the language, including what the words are, how words combine to form sentences, what the words mean, and how word meanings contribute to sentence meanings. However, linguistic behavior cannot be completely accounted for without also taking into account another aspect of what makes humans intelligentxe2x80x94their general world knowledge and their reasoning abilities. For example, to answer questions, to participate in a conversation, or to create and understand written language, a person not only must have knowledge about the structure of the language being used, but also must know about the world in general and the conversational setting in particular. Specifically, phonetic and phonological knowledge concerns how words are related to sounds that realize them. Morphological knowledge concerns how words are constructed from more basic units called morphemes. Syntactic knowledge concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Typical syntactic representations of language are based on the notion of context-free grammars, which represent sentence structure in terms of what phrases are subparts of other phrases. This syntactic information is often presented in a tree form. Semantic knowledge concerns what words mean and how these meanings combine in sentences to form sentence meanings. This is the study of context-independent meaningxe2x80x94the meaning a sentence has regardless of the context in which it is used. The representation of the context-independent meaning of a sentence is called its logical form. The logical form encodes possible word senses and identifies the semantic relationships between the words and phrases.
Natural language processing systems further include interpretation processes that map from one representation to the other. For instance, the process that maps a sentence to its syntactic structure and logical form is called parsing, and it is performed by a component called a parser. The parser uses knowledge about word and word meaning, the lexicon, and a set of rules defining the legal structures, the grammar, in order to assign a syntactic structure and a logical form to an input sentence.
Formally, a context-free grammar of a language is a four-tuple comprising nonterminal vocabularies, terminal vocabularies, a finite set of production rules, and a starting symbol for all productions. The nonterminal and terminal vocabularies are disjoint. The set of terminal symbols is called the vocabulary of the language. Pragmatic knowledge concerns how sentences are used in different situations and how use affects the interpretation of the sentence.
A natural language processor receives an input sentence, lexically separates the words in the sentence, syntactically determines the types of words, semantically understands the words, pragmatically determines the type of response to generate, and generates the response. The natural language processor employs many types of knowledge and stores different types of knowledge in different knowledge structures that separate the knowledge into organized types.
The complexity of the natural language process is increased due to lexical ambiguity of input sentences. Cases of lexical ambiguity may hinge on the fact that a particularly word has more than one meaning. For example, the word bank can be used to denote either a place where monetary exchange and handling takes place or the land close river, the bank of the river. A word or a small group of words may also have two or more related meanings. That is, the adjective bright may be used as a synonym for xe2x80x9cshiningxe2x80x9d (e.g., xe2x80x9cThe stars are bright tonightxe2x80x9d) or as a synonym for xe2x80x9csmartxe2x80x9d (e.g., xe2x80x9cShe must be very bright if she made an xe2x80x9cAxe2x80x9d on the testxe2x80x9d). In the field of spoken language translation, the problem is compounded by words that are not necessarily spelled the same but are pronounced the same and have different meanings. For example, the words night and knight are pronounced exactly the same although they are spelled differently, and they have very different meanings.
Factors causing the lexical ambiguity vary from one language to another. In character-based languages, e.g. in the Japanese language, extracting information from an input sentence creates a serious problem because Japanese sentences do not have spaces between words. Part-of-speech (POS) tags are another factor causing lexical ambiguity. In many languages, including both word-based and character-based natural languages, one word may have more than one POS tag depending on the context of POS within the sentence. The word table, for example, can be a verb in some contexts (e.g., xe2x80x9cHe will table the motionxe2x80x9d) and a noun in others (e.g., xe2x80x9cThe table is readyxe2x80x9d). The existence of multiword expressions in many languages, including the English language, is yet another factor contributing to lexical ambiguity. That is, depending on the context, a group of words, such as xe2x80x9cwhite housexe2x80x9d, can be treated as a multiword expression (e.g., xe2x80x9cI want to visit the White Housexe2x80x9d) or as separate words (e.g., xe2x80x9cHe lives in a white house across the streetxe2x80x9d).
One current approach that deals with lexical ambiguity in a Japanese input sentence involves treating each Japanese character as a word and letting the parser group the characters using the parsing grammar. After the parser defines the words, the parser must try all POS tags found for each word and rule out the impossible tags. As a result, the parsing program is time consuming and requires a large amount of space for its operation. If a long or complicated sentence is involved, such a parser may not be able to perform the parsing at all.
Another current approach to deal with lexical ambiguity recognizes all the possible words in a Japanese sentence and then finds possible connections between adjacent words. The recognition of all the words is done using a morpheme dictionary. The morpheme dictionary defines Japanese morphemes with the names of POS tags. The connectivity is defined using a connection-pair grammar. The connection-pair grammar defines pairs of sets of morphemes that may occur adjacently in a sentence. Various costs are then applied to the morphemes to compare all possible segmentations of the input sentence. These various costs correspond to the likelihood of observing a word as a certain part of speech and to the likelihood of observing two words in adjacent positions. In this approach, the segmentation that has the lowest corresponding cost is selected from all the possible segmentations of the input sentence for further processing. However, the segmentation selected based upon the lowest costs may not correspond to the correct meaning of the input sentence. Since the syntactic parser is better equipped to recognize the correct meaning of the input sentence, making a selection before the parsing operation may result in loss of pertinent information. Consequently, this approach may lead to inaccurate results in producing a response to an input sentence; especially in producing a response to a longer or more complicated sentence. The techniques currently used to deal with lexical ambiguity in an English sentence have problems similar to those identified above. Unlike Japanese sentences, English sentences do not need to be segmented as the individual words form the segments. However, multiple POS tags of a word present the same problem for English sentences as they do for Japanese sentences. As described above, one approach taken to deal with this problem requires the parser to try all POS tags found for each word and rule out the impossible tags. In this approach, the parsing program is very time consuming and requires a large amount of space for its operation. In addition, this approach may not be able to handle long and complicated sentences.
Another approach analyzes all POS tags for each word in an English input sentence and finds the most likely POS tag for each word using lexical and statistical probabilities. However, some probabilities may be hard to estimate. No matter how much text is analyzed for the estimation, there will always be a large volume of words that appear only a few times. Thus, relying strictly on probabilities may not result in an accurate interpretation, especially in dealing with a long or complex sentence in which a word""s meaning is dependent upon the context of the word within the sentence. As explained earlier, since the syntactic parser is better equipped to recognize the correct meaning of the input sentence, making a selection before the parsing operation may result in loss of pertinent information.
Therefore, what is required is an efficient way of reducing lexical ambiguity which will provide an accurate interpretation of an input sentence without unreasonably burdening the operation of the syntactic parser.
A method and system for reducing lexical ambiguity in an input stream are described. In one embodiment, the input stream is broken into tokens. The tokens are used to create a connection graph comprising a number of paths. Each of the paths is assigned a cost. At least one best path is defined based upon a corresponding cost to generate an output graph. The generated output graph is provided to reduce lexical ambiguity.