This invention relates to speech or voice translation systems. More particularly, this invention relates to a spoken language translation system that performs speech-to-speech translation.
Speech is the predominant mode of human communication because it is very efficient and convenient. Certainly, written language is very important, and much of the knowledge that is passed from generation to generation is in written form, but speech is a preferred mode for everyday interaction. Consequently, spoken language is typically the most natural, most efficient, and most expressive means of communicating information, intentions, and wishes. Speakers of different languages, however, face a formidable problem in that they cannot effectively communicate in the face of their language barrier. This poses a real problem in today""s world because of the ease and frequency of travel between countries. Furthermore, the global economy brings together business people of all nationalities in the execution of multinational business dealings, a forum requiring efficient and accurate communication. As a result, a need has developed for a machine-aided interpersonal communication system that accepts natural fluent speech input one language and provides an accurate near real-time output comprising natural fluent speech in another language. This system would relieve users of the need to possess specialized linguistic or translational knowledge. Furthermore, there is a need for the machine-aided interpersonal communication system to be portable so that the user can easily transport it.
A typical language translation system functions by using natural language processing. Natural language processing is generally concerned with the attempt to recognize a large pattern or sentence by decomposing it into small subpatterns according to linguistic rules. Until recently, however, natural language processing systems have not been accurate or fast enough to support useful applications in the field of language translation, particularly in the field of spoken language translation.
While the same basic techniques for parsing, semantic interpretation, and contextual interpretation may be used for spoken or written language, there are some significant differences that affect system design. For instance, with spoken input the system has to deal with uncertainty. In written language the system knows exactly what words are to be processed. With spoken language it only has a guess at what was said. In addition, spoken language is structurally quite different than written language. In fact, sometimes a transcript of perfectly understandable speech is not comprehensible when read. Spoken language occurs a phrase at a time, and contains considerable intonational information that is not captured in written form. It also contains many repairs, in which the speaker corrects or rephrases something that was just said. In addition, spoken dialogue has a rich interaction of acknowledgment and confirmation that maintains the conversation, which does not appear in written forms.
The basic architecture of a typical spoken language translation or natural language processing system processes sounds produced by a speaker by converting them into digital form using an analog-to-digital converter. This signal is then processed to extract various features, such as the intensity of sound at different frequencies and the change in intensity over time. These features serve as the input to a speech recognition system, which generally uses Hidden Markov Model (HMM) techniques to identify the most likely sequence of words that could have produced the speech signal. The speech recognizer then outputs the most likely sequence of words to serve as input to a natural language processing system. When the natural language processing system needs to generate an utterance, it passes a sentence to a module that translates the words into phonemic sequence and determines an intonational contour, and then passes this information on to a speech synthesis system, which produces the spoken output.
A natural language processing system uses considerable knowledge about the structure of the language, including what the words are, how words combine to form sentences, what the words mean, and how word meanings contribute to sentence meanings. However, linguistic behavior cannot be completely accounted for without also taking into account another aspect of what makes humans intelligent-their general world knowledge and their reasoning abilities. For example, to answer questions or to participate in a conversation, a person not only must have knowledge about the structure of the language being used, but also must know about the world in general and the conversational setting in particular.
The different forms of knowledge relevant for natural language processing comprise phonetic and phonological knowledge, morphological knowledge, syntactic knowledge, semantic knowledge, and pragmatic knowledge. Phonetic and phonological knowledge concerns how words are related to the sounds that realize them. Such knowledge is crucial for speech based systems. Morphological knowledge concerns how words are constructed from more basic units called morphemes. A morpheme is the primitive unit in a language, for example, the word friendly is derivable from the meaning of the noun friend and the suffix -ly, which transforms a noun into an adjective.
Syntactic knowledge concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Typical syntactic representations of language are based on the notion of context-free grammars, which represent sentence structure in terms of what phrases are subparts of other phrases. This syntactic information is often presented in a tree form.
Semantic knowledge concerns what words mean and how these meanings combine in sentences to form sentence meanings. This is the study of context-independent meaningxe2x80x94the meaning a sentence has regardless of the context in which it is used. The representation of the context-independent meaning of a sentence is called its logical form. The logical form encodes possible word senses and identifies the semantic relationships between the words and phrases.
Natural language processing systems further comprise interpretation processes that map from one representation to the other. For instance, the process that maps a sentence to its syntactic structure and logical form is called parsing, and it is performed by a component called a parser. The parser uses knowledge about word and word meaning, the lexicon, and a set of rules defining the legal structures, the grammar, in order to assign a syntactic structure and a logical form to an input sentence. Formally, a context-free grammar of a language is a four-tuple comprising nonterminal vocabularies, terminal vocabularies, a finite set of production rules, and a starting symbol for all productions. The nonterminal and terminal vocabularies are disjoint. The set of terminal symbols is called the vocabulary of the language. Pragmatic knowledge concerns how sentences are used in different situations and how use affects the interpretation of the sentence.
The typical natural language processor, however, has realized only limited success because these processors operate only within a narrow framework. A natural language processor receives an input sentence, lexically separates the words in the sentence, syntactically determines the types of words, semantically understands the words, pragmatically determines the type of response to generate, and generates the response. The natural language processor employs many types of knowledge and stores different types of knowledge in different knowledge structures that separate the knowledge into organized types. A typical natural language processor also uses very complex capabilities. The knowledge and capabilities of the typical natural language processor must be reduced in complexity and refined to make the natural language processor manageable and useful because a natural language processor must have more than a reasonably correct response to an input sentence.
Identified problems with previous approaches to natural language processing are numerous and involve many components of the typical speech translation system. Regarding the spoken language translation system, one previous approach combines the syntactic rules for analysis together with the transfer patterns or transfer rules. As a result, the syntactic rules and the transfer rules become inter-dependent, and the system becomes less modular and difficult to extend in coverage or apply to a new translation domain.
Another previous approach to natural language processing combines the syntactic analysis rules with domain-specific semantic analysis rules and also adds examples as annotations to those rules. During analysis using this system, the example annotations assist in the selection of the analysis rule that should be applied. This approach suffers from the same lack of modularity and inter-dependence as the previous approach.
Still another previous approach to natural language translation performs a dependency analysis first, and then performs an example-based transfer. This approach improves upon modularity, but dependency analysis is not powerful enough to handle a wide range of linguistic expressions, as dependency analysis merely takes the words in the input and arranges them in a dependency graph in order to show which word linguistically depends on another word. This previous approach does not perform analysis and generation that is in-depth enough and detailed enough for high-quality translation across a wide range of spoken expressions that occur in natural dialogue.
Problems are also prevalent in previous approaches to performing syntactic analysis in example-based translation systems. One previous approach performs dependency analysis to obtain surface word dependency graphs for the input and the examples of the example database. The problem, however, with this approach is that dependency grammar lacks the expressiveness required for many common spoken language constructions.
Another previous approach to performing syntactic analysis in example-based translation systems used in a transfer-based machine translation system performs constituent transfer using a combined syntacticsemantic grammar that is annotated with examples. Similarly, a pattern-based machine translation system uses a context-free grammar that combines syntactic rules with translation patterns.
Combined syntactic-semantic grammars such as used in transfer-based machine translation systems and the pattern-based machine translation systems make knowledge acquisition and maintenance very difficult, since syntactic analysis and analogical transfer rules become heavily interdependent. Furthermore, even a context-free grammar with feature constraints is not expressive enough. Moreover, some light-verb and copula constructions cannot be handled without the power to exchange feature values between the verb and its object.
Still another previous approach to performing syntactic analysis in example-based translation systems is to separate syntactic analysis from example-based transfer, and perform dependency analysis on both the input string and the example data. This separation helps keep knowledge acquisition and maintenance simple, but dependency analysis is far less powerful for taking advantage of syntactic regularities found in natural language.
Example-based translation is a method for translation that uses bilingual example pairs to encode translation correspondences or translation knowledge. An example-based translation system uses an example database, a stored set of corresponding words, phrases, expressions, or sentences in the source and target languages. The typical example-based system performs the following steps: accepts input in the source language; matches the input to the source expressions of the example pairs in the example database, and finds the most appropriate example or examples; takes the target expressions from the best-matching examples and constructs an expression in the target language; and outputs the target language translation.
A previous approach to solving the problem or performing example-based translation with examples having different degrees of specificity performs the following steps: perform dependency analysis on the example pairs in the example database; perform dependency analysis on the input expression; select a set of example fragments that completely covers the input; construct the target expression using the target fragments corresponding to the selected source fragments; and, output the target language translation.
There are a number of problems with this previous approach. First, dependency analysis is not detailed enough to account for many natural language expressions as the matching is essentially performed on the words in the input. Second, this approach is limited to using examples that all have the same degree of linguistic specificity. That is, there is no way to use translation knowledge that ranges from the very general and abstract to the very precise and specific. The third problem with this approach is that for a match to be found, all arcs in the dependency tree are required to be matched. This means that it is not possible to delete or insert words. This kind of precise match is not useful for translating spoken language. The translation component in a spoken language translation system has to be able to handle input that has incorrectly added/deleted/substituted words because of mistakes in the speech recognizer. In addition, natural speech of people is not perfectly complete and grammaticalxe2x80x94it also includes repeated words, omissions, and incomplete sentences.
English morphology is a relatively well understood linguistic phenomenon, but its computational treatment in natural language processing and the design and integration of a morphological analyzer with other components of a system can be performed using one of two previous approaches. The approach used depends on the envisioned application and efficiency considerations. The previous alternatives include not performing morphological analysis, and using two-level morphological analysis.
If no morphological analyzer is used in natural language processing applications, the only alternative for handling morphology is via a full-form dictionary, or a dictionary that contains each and every word inflection that can constitute an input as a separate dictionary entry (e.g. xe2x80x9cwalkxe2x80x9d; xe2x80x9cwalksxe2x80x9d; xe2x80x9cwalkedxe2x80x9d; xe2x80x9cwalkingxe2x80x9d . . . all have to be listed). The problem with this approach is that the system is required to have a large amount of memory to accommodate the dictionary and, because of the access time required, the language processing is inefficient.
Typical two-level morphological analyzers apply an array of morphological rules in parallel, with the rules being compiled into a Finite-State Transducer (FST) that relates the two levels. The problem with this analysis is that, while it allows for descriptions of a range of languages with more complicated morphology than English, it has the disadvantages of two-level morphology, notably slow processing speed, notational complexity, and the problem that correct analysis is possible only if the FST makes its way to the end.
A Generalized Left-to-Right (Generalized LR or GLR) parsing algorithm was developed as an extension of the Left-to-Right (LR) parsing algorithm to provide for efficient parsing of natural language. The graphstructured stack was also introduced for handling ambiguities in natural language. All the possible parse trees are stored in a data structure called the packed parse forest. The run-time parser is driven by a table that is pregenerated by a compiler that accepts context-free grammars.
One previous GLR parser supports grammatical specifications that consist of context-free grammar rules bundled with feature structure constraints. Feature structure manipulation is performed during parsing, and the result of parsing an input sentence consists of both a context-free parse tree and feature structure representations associated with the nodes in the parse tree. The problem with this parser is that it is implemented in List Processing (LISP), which is not efficient for practical use. Furthermore, its feature structure manipulations allow only unique slot-names, which is not suitable for shallow syntactic analysis where multiple slots are routinely needed. In addition, its local ambiguity packing procedure may cause incorrect results when implemented with feature structure manipulation.
Another previous GLR parser accepts arbitrary context-free grammar rules and semantic actions. It uses the GLR algorithm as its parsing engine, but handles semantic actions by separating them into two sets: a first set, intended for simple disambiguation instructions, which is executed during the parsing process; and a second set, intended for structure-building, which is executed after a complete first-stage parse has been found. The problem with this parser is that its two-stage design is impractical for large-scale natural language parsing because most actions must be duplicated in the second instruction set.
A method and an apparatus for performing spoken language translation are provided. A speech input is received comprising at least one source language. The speech input comprises words, sentences, and phrases in a natural spoken language. Source expressions are recognized in the source language. Misrecognitions of the source expressions resulting from factors comprising noise and speaker variation are minimized by the generation of intermediate data structures that encode at least one recognition hypothesis. Furthermore, misrecognitions are minimized by the generation of candidate recognized source expressions by processing the intermediate data structures using models comprising a general language model and a domain model. A recognized source expression is selected and confirmed by a user through a user interface. The recognized source expressions are translated from the source language to a target language, and a speech output is synthesized from the translated target language source expressions. Moreover, a meaning of the speech input is detected, and the meaning is rendered in the synthesized translated output.
The translation comprises performing morphological analysis of the recognized source expression in order to generate a sequence of analyzed morphemes. Syntactic source language analysis is performed using grammar rule-based processing and example-based processing in order to generate a source language syntactic representation. Source language to target language transfer is then performed using an example database. At least one target language syntactic representation is then generated, and target language syntactic generation is performed using a set of target language syntactic generation rules. A sequence of target language morpheme specifications are generated, and target language morphological generation is performed.
These and other features, aspects, and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description and appended claims which follow.