1. Field of Invention
The techniques described herein are directed generally to the field of speech synthesis, and more particularly to techniques for providing speech output for speech-enabled applications.
2. Description of the Related Art
Speech-enabled software applications exist that are capable of providing output to a human user in the form of speech. For example, in an interactive voice response (IVR) application, a user typically interacts with the software application using speech as a mode of both input and output. Speech-enabled applications are used in many different contexts, such as telephone call centers for airline flight information, banking information and the like, global positioning system (GPS) devices for driving directions, e-mail, text messaging and web browsing applications, handheld device command and control, and many others. When a user communicates with a speech-enabled application by speaking, automatic speech recognition is typically used to determine the content of the user's utterance and map it to an appropriate action to be taken by the speech-enabled application. This action may include outputting to the user an appropriate response, which is rendered as audio speech output through some form of speech synthesis (i.e., machine rendering of speech). Speech-enabled applications may also be programmed to output speech prompts to deliver information or instructions to the user, whether in response to a user input or to other triggering events recognized by the running application.
Techniques for synthesizing output speech prompts to be played to a user as part of an IVR dialog or other speech-enabled application have conventionally been of two general forms: concatenated prompt recording and text to speech synthesis. Concatenated prompt recording (CPR) techniques require a developer of the speech-enabled application to specify the set of speech prompts that the application will be capable of outputting, and to code these prompts into the application. Typically, a voice talent (i.e., a particular human speaker) is engaged during development of the speech-enabled application to speak various word sequences or phrases that will be used in the output speech prompts of the running application. These spoken word sequences are recorded and stored as audio recording files, each referenced by a particular filename. When specifying an output speech prompt to be used by the speech-enabled application, the developer designates a particular sequence of audio prompt recording files to be concatenated (e.g., played consecutively) to form the speech output.
FIG. 1A illustrates steps involved in a conventional CPR process to synthesize an example desired speech output 110. In this example, the desired speech output 110 is, “Arriving at 221 Baker St. Please enjoy your visit.” Desired speech output 110 could represent, for example, an output prompt to be played to a user of a GPS device upon arrival at a destination with address 221 Baker St. To specify that such an output prompt should be synthesized through CPR in response to the detection of such a triggering event by the speech-enabled application, a developer would enter the output prompt into the application software code. An example of the substance of such code is given in FIG. 1A as example input code 120.
Input code 120 illustrates example pieces of code that a developer of a speech-enabled application would enter to instruct the application to form desired speech output 110 through conventional CPR techniques. Through input code 120, the developer directly specifies which pre-recorded audio files should be used to render each portion of desired speech output 110. In this example, the beginning portion of the speech output, “Arriving at”, corresponds to an audio file named “i.arrive.wav”, which contains pre-recorded audio of a voice talent speaking the word sequence “Arriving at” at the beginning of a sentence. Similarly, an audio file named “m.address.hundreds2.wav” contains pre-recorded audio of the voice talent speaking the number “two” in a manner appropriate for the hundreds digit of an address in the middle of a sentence, and an audio file named “m.address.units21.wav” contains pre-recorded audio of the voice talent speaking “twenty-one” in a manner appropriate for the units of an address in the middle of a sentence. These audio files are selected and ordered as a sequence of audio segments 130, which are ultimately concatenated to form the speech output of the speech-enabled application. To specify that these particular audio files be selected for the various portions of the desired speech output 110, the developer of the speech-enabled application enters their filenames (i.e., “i.arrive.wav”, “m.address.hundreds2.wav”, etc.) into input code 120 in the proper sequence.
For some specific types of desired speech output portions (generally conveying numeric information), such as the address number “221” in desired speech output 110, an application using conventional CPR techniques can also issue a call-out to a separate library of function calls for mapping those specific word types to audio recording filenames. For example, for the “221” portion of desired speech output 110, input code 120 could contain code that calls the name of a specific function for mapping address numbers in English to sequences of audio filenames and passes the number “221” to that function as input. Such a function would then apply a hard coded set of language-specific rules for address numbers in English, such as a rule indicating that the hundreds place of an address in English maps to a filename in the form of “m.address.hundreds_.wav” and a rule indicating that the tens and units places of an address in English map to a filename in the form of “m.address.units_.wav”. To make use of such function calls, a developer of a speech-enabled application would be required to supply audio recordings of the specific words in the specific contexts referenced by the function calls, and to name those audio recording files using the specific filename formats referenced by the function calls.
In the example of FIG. 1A, the “Baker” portion of desired speech output 110 does not correspond to any available audio recordings pre-recorded by the voice talent. For example, in many instances it can be impractical to engage the voice talent to pre-record speech audio for every possible street name that a GPS application may eventually need to include in an output speech prompt. For such desired speech output portions that do not match any pre-recorded audio, speech-enabled applications relying primarily on CPR techniques are typically programmed to issue call-outs (in a program code form similar to that described above for calling out to a function library) to a separate text to speech (TTS) synthesis engine, as represented in portion 122 of example input code 120. The TTS engine then renders that portion of the desired speech output as a sequence of separate subword units such as phonemes, as represented in portion 132 of the example sequence of audio segments 130, rather than a single audio recording as produced naturally by a voice talent.
Text to speech (TTS) synthesis techniques allow any desired speech output to be synthesized from a text transcription (i.e., a spelling out, or orthography, of the sequence of words) of the desired speech output. Thus, a developer of a speech-enabled application need only specify plain text transcriptions of output speech prompts to be used by the application, if they are to be synthesized by TTS. The application may then be programmed to access a separate TTS engine to synthesize the speech output. Conventional TTS engines most commonly produce output audio using concatenative text to speech synthesis, whereby the input text transcription of the desired speech output is analyzed and mapped to a sequence of subword units such as phonemes. The concatenative TTS engine typically has access to a database of small audio files, each audio file containing a single subword unit (e.g., a phoneme or a portion of a phoneme) excised from many hours of speech pre-recorded by a voice talent. Complex statistical models are applied to select preferred subword units from this large database to be concatenated to form the particular sequence of subword units of the speech output.
Other techniques for TTS synthesis exist that do not involve recording any speech from a voice talent. Such TTS synthesis techniques include formant synthesis and articulatory synthesis, among others. In formant synthesis, an artificial sound waveform is generated and shaped to model the acoustics of human speech. A signal with a harmonic spectrum, similar to that produced by human vocal folds, is generated and filtered using resonator models to impose spectral peaks, known as formants, on the harmonic spectrum. Parameters such as periodic voicing, fundamental frequency, turbulence noise levels, formant frequencies and bandwidths, spectral tilt and the like are varied over time to generate the sound waveform emulating a sequence of speech sounds. In articulatory synthesis, an artificial glottal source signal, similar to that produced by human vocal folds, is filtered using computational models of the human vocal tract and of the articulatory processes that change the shape of the vocal tract to make speech sounds. Each of these TTS synthesis techniques typically involves representing the input text as a sequence of phonemes, and applying complex models (acoustic and/or articulatory) to generate output sound for each phoneme in its specific context within the sequence.
In addition to sometimes being used to fill in small gaps in CPR speech output, as illustrated in FIG. 1A, TTS synthesis is sometimes used to implement a system for synthesizing speech output that does not employ CPR at all, but rather uses only TTS to synthesize entire speech output prompts, as illustrated in FIG. 1B. FIG. 1B illustrates steps involved in conventional full concatenative TTS synthesis of the same desired speech output 110 that was synthesized using CPR techniques in FIG. 1A. In the TTS example of FIG. 1B, a developer of a speech-enabled application specifies the output prompt by programming the application to submit plain text input to a TTS engine. The example text input 150 is a plain text transcription of desired speech output 110, submitted to the TTS engine as, “Arriving at 221 Baker St. Please enjoy your visit.” The TTS engine typically applies language models to determine a sequence of phonemes corresponding to the text input, such as phoneme sequence 160. The TTS engine then applies further statistical models to select small audio files from a database, each small audio file corresponding to one of the phonemes (or a portion of a phoneme, such as a demiphone, or half-phone) in the sequence, and concatenates the resulting sequence of audio segments 170 in the proper sequence to form the speech output. The database typically contains a large number of phoneme audio files excised from long recordings of the speech of a voice talent. Each phoneme is typically represented by multiple audio files excised from different times the phoneme was uttered by the voice talent in different contexts (e.g., the phoneme /t/ could be represented by an audio file excised from the beginning of a particular utterance of the word “tall”, an audio file excised from the middle of an utterance of the word “battle”, an audio file excised from the end of an utterance of the word “pat”, two audio files excised from an utterance of the word “stutter”, and many others). Statistical models are used by the TTS engine to select the best match from the multiple audio files for each phoneme given the context of the particular phoneme sequence to be synthesized. The long recordings from which the phoneme audio files in the database are excised are typically made with the voice talent reading a generic script, unrelated to any particular speech-enabled application in which the TTS engine will eventually be employed.