1. Field of the Invention
The present invention generally relates to automated speech processing. More particularly, the invention relates to method and systems that both convert speech to text, translate that text from one language to another language, and the display the translated text.
2. Background Art
Automated speech processing is used in many contexts including automatically generated closed-captions of broadcasts. Those broadcasts are now considered routine; they utilize both automatic speech recognition to create a transcription of a speaker's words, and automatic machine translation to translate the transcription from a source language into a target language. For example, in the TALES system, an Arabic, Chinese, or Spanish-language broadcast is automatically captioned with an English text, and the meaning of the spoken part of the broadcast is made apparent to viewers who do not speak these languages.
A number of procedures are currently available for speech recognition—that is, converting speech to text. The procedures have different levels of accuracy, security, speed, tolerance of poor audio quality and price. Court reporters or stenographers, for example, provide verbatim transcription but at a high price and with a time delay. Computer-based speech recognition is much less accurate but is less expensive and can be done substantially in real time. Transcription of stored messages, such as voice mail, is more difficult for computer-based speech recognition technology to perform accurately due to poor audio quality.
Machine translation, in general, makes use of computers to automate some or all of the process of translating text from one language to another. Originally, many machine translation systems used a word-based approach. Words were treated as the basic translation element; and, with some exceptions, each source language word was translated into a target language word to form the translation. Recently, significant advances have been made that use a phrase-based approach, which enables better handling of differences in linguistic typology, phrase recognition, and translation of idioms.
Many existing phrase-based translation systems still suffer from several disadvantages. For example, although they may robustly perform translations that are localized to a few consecutive words that have been recognized in training, most existing systems do not account for long-distance word dependency. For example, learning non-contiguous phrases, e.g., English-French pairs as simple as “not”→“ne . . . pas”, can still be difficult in current phrasal systems.
Both the automatic speech recognition (speech-to-text, STT) and machine translation (MT) components make mistakes—and furthermore when STT and MT are used together, these mistakes may be compounded because the erroneous output of the speech recognition component is used as the input to the machine translation component, which itself introduces further errors. Additionally, machine translation may substantially reorder the concepts in a sentence, often in ways that conflict with scene changes in a video that accompanies the speaker. As a result, the viewer may be left quite confused about the speaker's intended meaning—superficially the erroneous parts of the transcript look very similar to the accurate parts. It is desirable if additional visual clues can be provided to help the viewer focus on the parts of the transcript that are likely to be the most accurate and meaningful. Such a system may also convey other characteristics (metadata) of the speech recognition and machine translation system that are informative to the viewer.