Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Speech processing systems such as text-to-speech (TTS) systems and automatic speech recognition (ASR) systems may be employed, respectively, to generate synthetic speech from text and generate text from audio utterances of speech.
A first example TTS system may concatenate one or more recorded speech units to generate synthetic speech. A second example TTS system may concatenate one or more statistical models of speech to generate synthetic speech. A third example TTS system may concatenate recorded speech units with statistical models of speech to generate synthetic speech. In this regard, the third example TTS system may be referred to as a hybrid TTS system.
Some ASR systems use “training” where an individual speaker reads sections of text into the speech recognition system. These systems analyze a specific voice of a person and use the voice to fine tune recognition of that speech for that person resulting in more accurate transcription. Systems that do not use training may be referred to as “Speaker Independent” systems. Systems that use training may be referred to as “Speaker Dependent” systems.
Such speech processing systems may operate in a single language such as a system language or native language. In one example, a TTS system may generate synthetic English language speech that corresponds to English language text input to the TTS system. In another example, an ASR system may map audio utterances of speech by an English language speaker to English language text.