Text-to-speech (TTS) synthesis technology gives machines the ability to convert machine-readable text into audible speech. TTS technology is useful when a computer application needs to communicate with a person. Although recorded voice prompts often meet this need, this approach provides limited flexibility and can be very costly in high-volume applications. Thus, TTS is particularly helpful in telephone services, providing general business (stock quotes) and sports information, and reading e-mail or Web pages from the Internet over a telephone.
Speech synthesis is technically demanding since TTS systems must model generic and phonetic features that make speech intelligible, as well as idiosyncratic and acoustic features that make it sound human. Although written text includes phonetic information, vocal qualities that represent emotional states, moods, and variations in emphasis or attitude are largely unrepresented. For instance, the elements of prosody, which include register, accentuation, intonation, and speed of delivery, are rarely represented in written text. However, without these features, synthesized speech sounds unnatural and monotonous.
Generating speech from written text essentially involves textual and linguistic analysis and synthesis. The first task converts the text into a linguistic representation, which includes phonemes and their duration, the location of phrase boundaries, as well as pitch and frequency contours for each phrase. Synthesis generates an acoustic waveform or speech signal from the information provided by linguistic analysis.
A block diagram of a conventional customer-care system 10 involving both speech recognition and generation within a telecommunication application is shown in FIG. 1. A user 12 typically inputs a voice signal 22 to the automated customer-care system 10. The voice signal 22 is analyzed by an automatic speech recognition (ASR) subsystem 14. The ASR subsystem 14 decodes the words spoken and feeds these into a spoken language understanding (SLU) subsystem 16.
The task of the SLU subsystem 16 is to extract the meaning of the words. For instance, the words “I need the telephone number for John Adams” imply that the user 12 wants operator assistance. A dialog management subsystem 18 then preferably determines the next action that the customer-care system 10 should take, such as determining the city and state of the person to be called, and instructs a TTS subsystem 20 to synthesize the question “What city and state please?” This question is then output from the TTS subsystem 20 as a speech signal 24 to the user 12.
There are several different methods to synthesize speech, but each method can be categorized as either articulatory synthesis, formant synthesis, or concatenative synthesis. Articulatory synthesis uses computational biomechanical models of speech production, such as models of a glottis, which generate periodic and aspiration excitation, and a moving vocal tract. Articulatory synthesizers are typically controlled by simulated muscle actions of the articulators, such as the tongue, lips, and glottis. The articulatory synthesizer also solves time-dependent three-dimensional differential equations to compute the synthetic speech output. However, in addition to high computational requirements, articulatory synthesis does not result in natural-sounding fluent speech.
Formant synthesis uses a set of rules for controlling a highly simplified source-filter model that assumes that the source or glottis is independent from the filter or vocal tract. The filter is determined by control parameters, such as formant frequencies and bandwidths. Formants are associated with a particular resonance, which is characterized as a peak in a filter characteristic of the vocal tract. The source generates either stylized glottal or other pulses for periodic sounds, or noise for aspiration. Formant synthesis generates intelligible, but not completely natural-sounding speech, and has the advantages of low memory and moderate computational requirements.
Concatenative synthesis uses portions of recorded speech that are cut from recordings and stored in an inventory or voice database, either as uncoded waveforms, or encoded by a suitable speech coding method. Elementary units or speech segments are, for example, phones, which are vowels or consonants, or diphones, which are phone-to-phone transitions that encompass a second half of one phone and a first half of the next phone. Diphones can also be thought of as vowel-to-consonant transitions.
Concatenative synthesizers often use demi-syllables, which are half-syllables or syllable-to-syllable transitions, and apply the diphone method to the time scale of syllables. The corresponding synthesis process then joins units selected from the voice database, and, after optional decoding, outputs the resulting speech signal. Since concatenative systems use portions of pre-recorded speech, this method is most likely to sound natural.
Each of the portions of original speech has an associated prosody contour, which includes pitch and duration uttered by the speaker. However, when small portions of natural speech arising from different utterances in the database are concatenated, the resulting synthetic speech may still differ substantially from natural-sounding prosody, which is instrumental in the perception of intonation and stress in a word.
Despite the existence of these differences, the speech signal 24 output from the conventional TTS subsystem 20 shown in FIG. 4 is readily recognizable by speech recognition systems. Although this may at first appear to be an advantage, it actually results in a significant drawback that may lead to security breaches, misappropriation of information, and loss of data integrity.
For instance, assume that the customer-care system 10 shown in FIG. 1 is an automated banking system 11 as shown in FIG. 2, and that the user 12 has been replaced by an automated interactive voice response (IVR) system 13, which utilizes speech recognition to interface with the TTS subsystem 20 and synthesized speech generation to interface with the speech recognition subsystem 14. Speaker-dependent recognition systems require a training period to adjust to variations between individual speakers. However, all speech signals 24 output from the TTS subsystem 20 are typically in the same voice, and thus appear to the IVR system 13 to be uttered from the same person, which further facilitates its recognition process.
By integrating the IVR system 13 with an algorithm to collect and/or modify information obtained from the automated banking system 11, potential security breaches, credit fraud, misappropriation of funds, unauthorized modification of information, and the like could easily be implemented on a grand scale. In view of the foregoing considerations, a method and system are called for to address the growing demand for securing access to information available from TTS systems.