1. Field of the Invention
The present invention is directed generally to speech-to-speech translation systems for cross-lingual communication, and more particularly, to a method and apparatus for field maintenance that enables users to add new vocabulary items and to improve and modify the content and usage of their system in the field, without requiring linguistic or technical knowledge or expertise.
2. Description of the Invention Background
Automatic speech recognition (ASR) and machine translation (MT) technologies have matured to the point where it has become feasible to develop practical speech translation systems on laptops or mobile devices for limited and unlimited domains. Domain limited speech-to-speech systems, in particular, have been developed in the research field and in research laboratories for a variety of application domains, including tourism, medical deployment and for military applications. Such systems have been seen before in the works of A. Waibel, C. Fugen, “Spoken language translation” in Signal Processing Magazine, IEEE May 2008; 25(3):70-79, In Proc. HLT, 2003; and Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Köhler, Sebastian Stüker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black, for examples. “The CMU TransTac 2007 eyes-free and hands-free two-way speech-to-speech translation system,” In Proc. of the IWSLT, Trento, Italy, October 2007. They are limited, however, in that they operate with a limited vocabulary which is defined by the developers of the system in advance, and is determined by the application domain, and the location where it is envisioned the system will be used. Thus vocabularies and language usage are determined largely based on example scenarios and by data that is collected or presumed in such scenarios.
In field situations, however, actual words and language usage deviate from the anticipated scenario of the laboratory. Even in simple domains such as tourism language usage will vary dramatically in the field as a user travels to different locations, interacts with different people and pursues different goals and needs. Thus, new words and new expressions will always arise. Such new words—in speech recognition parlance “out-of-vocabulary” (OOV) words will be misrecognized as an in-vocabulary word and then translated incorrectly. The user may attempt a paraphrase, but if a critical word or concept (such as a person or a city name) cannot be entered or communicated, the absence of the word or expression may lead to communication break-down.
Despite the need for user modifiable speech-to-speech translation systems, an actual solution has so far not been proposed. While adding a word to the system may seem to be easy, making such modifications proves to be extraordinarily difficult. Appropriate modifications must be made to many component modules throughout the entire system, and most modules would have to be retrained to restore the balance and integrated functioning of the components. Indeed, about 20 different modules would have to be modified or re-optimized to learn a new word. Such modifications require expertise and experience with the components of a speech translation system, and as a result, to the inventor's understanding, such modifications have so far been done only in the laboratory by experts, requiring human expertise, time and cost.
For example, if a system designed for users in Europe does not contain the name “Hong Kong” in the vocabulary. Once a speaker speaks the sentence “Let's go to Hong Kong”, the system will recognize the closest sounding similar word in the dictionary and produce: “Let's go to home call”. At this point it is not obvious if the error was the result of a recognition error or result of the absence of this word in the entire speech-to-speech translation system. The user therefore proceeds to correct the system. This can be done by one of several correction techniques. The simplest might be re-spealcing or typing, but it can alternatively be done more effectively by cross-modal error correction techniques as described by other disclosures and prior art (Waibel, et al., U.S. Pat. No. 5,855,000). Once the correct spelling of the desired word sequence has been established (“Let's go to Hong Kong”), the system performs a translation. If “Hong Kong” is in the dictionary, the system would proceed from there normally, performing translation and synthesis. If, however, it is absent from the recognition and translation dictionary, the system would need to establish if this word is a named entity or not. Finally, and most importantly, even if a name or word can be translated properly to the output languages by user intervention, without learning it, the system would fail again when the user speaks the same word the next time around.
Unfortunately, learning a new word cannot be addressed just by simply typing in a new word in a word list, but it requires changes at about 20 different points and at all levels of a speech translation system. Presently it also involves manual tagging and editing of entries, collection of extensive databases involving the required word, retraining of language model and translation model probabilities and re-optimization of the entire system, so as to re-establish the consistency between all the components and components' dictionaries and to restore the statistical balance between the words, phrases and concepts in the system (probabilities have to add up to 1, and thus all words would be affected by a single word addition).
As a result, even small modifications of existing speech translation systems have generally required use of advanced computing tools and linguistic resources found in research labs. For actual field use, however, it is unacceptable to require every modification to be done at the lab, since it takes too much time, effort and cost. Instead, a learning and customization module is needed that hides all the complexity from the user, and performs all the critical operations and language processing steps semi-autonomously or autonomously behind the scenes, and interacts with the human user in the least disruptive manner possible by way of a simple intuitive interface, thereby eliminating the need for linguistic or technical expertise in the field altogether. In the present invention, we provide a detailed description for a learning and customization module that satisfies these needs.
Unfortunately, translation systems are often prohibitively complex such that access for users is not practicable or used. Thus, there is a need for systems and methods that use machine translation techniques and enable user modification capabilities to provide cross-lingual communication without requiring linguistic or technical knowledge or expertise, making it possible to overcome language barriers and bring people closer together.