An automatic speech-to-speech (S2S) interpreter is an electronic interpreter that enables two people who speak different natural languages to communicate with each other.
The interpreter consists of a computer, which has a graphical and/or verbal interface; one or more audio input devices to detect input speech signals, such as a receiver or microphone; and one or more audio output devices such as a speaker. The core of the interpreter is the software, which comprises three components: a speech recognizer, an interpretation engine, and an output processor.
Automatic speech recognition (ASR) can be defined as the conversion of an input speech signal into text. The text may be a “one best” recognition, an “n best” recognition, or a word-recognition lattice, with respective associated recognition confidences. The broader the domain that an ASR engine is trained to recognize, the worse the recognition results. This balance between recognition coverage and precision is a recurring theme in the field of pattern recognition and is fundamental to the assessment of each component's performance.
Interpretation is the task of providing a representation in one language to a representation in another language. This can be done through a classifier, that is, viewing interpretation as if we are classifying speech input into one of many bins, (see U.S. patent application Ser. No. 11/965,711), as well as automatic machine translation (MT). MT is the task of translating text in one natural language to another language. Machine translation is generally performed by one or more of the following broad categories: rule-based machine translation (RBMT), template based machine translation (TBMT), and statistical machine translation (SMT). A combination of these engines may be used to perform interpretation.
Speech synthesis is often accomplished using a text-to-speech (TTS) processor which handles how interpreted text is converted into sound. Systems are trained on recorded speech in the target language. Phone or word sequences are sampled and stitched together to derive the output signal.
S2S interpretation systems are subject to propagation of error. The quality of the input signal affects the quality of the speech recognition. Similarly, the quality of the recognized text directly affects the quality of the interpretation and thereby also the output of the system via a TTS processor. Additionally, each component contributes its own error. A robust S2S system is able to minimize these errors and improve the output of any one component by applying constraints from the succeeding component, thereby rendering the system robust to that error.