1. Technical Field
The present invention relates to speech recognition and more particularly to systems and methods that employ bridging models to improve interaction between separately optimized speech translation components.
2. Description of the Related Art
State-of-the-art speech-to-speech (S2S) translation is usually implemented as a cascaded system connecting different modules including automatic speech recognition (ASR), machine translation (MT) and text to speech (TTS) modules. Simply cascading these modules sequentially is far from optimal. Different modules are typically built independently and optimized separately. However, robust and efficient end-to-end system performance is needed.
To further illustrate, there are problems with MT when simply taking ASR output as an input. ASR is not perfect, especially for speech with accents or under noisy conditions. Errors in the ASR output present clear challenges to MT engines, which are usually very sensitive to disfluency and recognition errors. For example, imagine “what incident occurred” was misrecognized as “white incident occurred”. In extreme cases, it takes only one or two misrecognized function/common words to break long phrases that otherwise could have been translated correctly. Consequently the speech-to-speech communication channel can be misinterpreted completely.
Another issue for connecting ASR and MT is mismatch between styles of translation model training data and ASR hypotheses. While translation models are usually estimated from well-structured parallel corpora, ASR hypotheses for speech translation are usually in spontaneous informal spoken form. There will be mismatches between their respective vocabularies as well. For example, there are at least five alternative spelling variations for the common name “Muhammad”. It is quite possible that the alternative spelling set in ASR is not a subset of or has no overlap with that of MT.
Similarly, mismatch of vocabulary and training corpora can also occur between MT and TTS. For example, punctuation can provide important clues for prosody information generation. However, speech translation usually has no word duration and punctuation in the MT output.
To improve system robustness, tighter integrations between ASR and MT have been suggested. One approach is to translate top N-best ASR hypotheses rather than the best hypothesis. A machine translation component in this type of solution directly takes the word lattice generated by ASR module as input. Similarly, the ASR produces a word confusion network and sends it to the MT component for translation. The N-best list, word lattice or confusion network provide more information than the single best hypothesis. However, they have been shown to be ineffective in improving the translation system performance and robustness. The variations in the N-best list, word lattice and confusion network are limited by the ASR module. They also significantly increase the MT computation cost.
In building a speech translation system, ASR and MT modules can interact. For example, speech recognition receives feedback from the MT module and then adapts an acoustic model to improve recognition robustness. The feedback for model training/adaptation can be carried out between ASR and MT in offline model training only.
An alternative approach to improve system robustness performs a kind of normalization or transformation on ASR output before sending the output for translation. Speech reconstruction using parsing algorithms were proposed where disfluencies such as short repetitions in ASR hypotheses are targeted to be detected and repaired to generate more grammatically correct output. The goal was to make the ASR output more readable and accessible to human beings and other upstream applications. However, applying parsing techniques are limited for reconstruction since spoken language can be quite informal. Moreover, the parsing techniques usually ignore phonetic clues and do not model translatability directly.
Another approach for transforming ASR hypotheses is called canonicalization, where the ASR output is analyzed and canonicalized into one of many predefined semantically structured formats where human translations are memorized. The usefulness of this method is limited in free-form speech translation. The method can only handle limited variances of finite templates.
Similar to canonicalization, speech translation systems may have a database of sentence lists, e.g., having frequently spoken sentences/phrases, where human translations are memorized. Given an ASR hypothesis, an information retrieval method is applied to identify those sentences in the database that are similar to the ASR output. Users are directed to select the ASR output together with retrieved similar sentences in the database. Like canonicalization, this method is also limited by the database.