The invention relates generally to an arrangement which provides speech output and more particularly to an arrangement that combines recorded speech prompts with speech that is produced by a synthesizing technique.
Current applications requiring speech output, depending on the task, may use announcements or interactive prompts that are either recorded or generated by text-to-speech synthesis (TTS). Unit section TTS techniques, such as those described in “Unit Selection in a Concatenative Speech Synthesis System Using Large Speech Database” by Hunt et al., Proc. IEEE Intl. Conf. Acoustic, Speech, Signal Processing, pp. 373-376, 1996, yield what is considered high-quality synthesis, but results are nevertheless significantly less intelligible and natural than recorded speech. Recorded prompts are often preferred in situations where (a) there are a limited number of basically fixed prompts required for the application and/or (b) the speech is required to be of very high quality. An example might be the welcoming initial prompt for an Interactive Voice Response (IVR) system, introducing the system. TTS is used in situations where the vocabulary of an application is prohibitively large to be covered by recorded speech or where an IVR system needs to be able to respond in a very flexible way. One example, might be a reverse telephone directory for name and address information.
The advantage of TTS lies in the almost infinite range of responses possible, the low cost, high efficiency, and flexibility of being able to experiment with a wide range of utterances (especially for rapid prototyping of a service). The main disadvantage is that quality is currently lower than that of recorded speech.
While recorded speech has the advantage of higher speech quality, its disadvantages are lack of flexibility, both short term and long term, low scalability, high storage requirements for recorded speech files, and the high cost of recording a high quality voice, especially if additional material may be required later.
Depending on the application requirements, the appropriateness of one or the other type of speech output will vary. Many applications attempt to compromise, or benefit from the best aspects of both, some by combining TTS with recorded prompts, some by adopting one of the following methods.
Limited domain synthesis is a technique for achieving high quality synthesis by specializing and carefully designing the recorded database. An example of a limited domain application might be weather report reading for a restricted geographical region. The system may also rely on constraining the structure of the output in order to achieve the quality gains desired. The approach is automated, and the quality gains are a function of the choice of domain and of the database.
Another method for which much work has been done is in allowing the customization of automatic text to speech. This technique comes under the general heading of adding control or escape sequences to the text input, more recently called markup. Diphone synthesis systems frequently allow the user to insert special character sequences into the text to influence the way that things get spoken (often including an escape character, hence the name). The most obvious example of this would be where a different pronunciation of a word is desired compared with the system's default pronunciation. Markup can also be used to influence or override prosodic treatment of sentences to be synthesized, e.g., to add emphasis to a word. Such systems basically fall into three categories: (a) nearly all systems have escape or control sequences that are system specific; (b) standardized markup for synthesis e.g., SSML (See SSML: A speech synthesis markup language, Speech Communication, Vol. 21, pp. 123-133, 1997, the entirety of which is incorporated herein by reference); and (c) more generally a kind of mode based on the type of a document or dialog schema, such as SALT (See SALT: a spoken language interface for web-based multimodal dialog system, Intl. Conf. on Spoken Language processing ICSLP 2002, pp. 2241-2244), which subsumes SSML.
A block diagram of a typical concatenative TTS system is shown in FIG. 1. The first block 101 is the message text analysis module that takes ASCII message text and converts it to a series of phonetic symbols and prosody (fundamental frequency, duration, and amplitude) targets. The text analysis module actually consists of a series of modules with separate, but in many cases intertwined, functions. Input text is first analyzed and non-alphabetic symbols and abbreviations are expanded to full words. For example, in the sentence “Dr. Smith lives at 4035 Elm Dr.”, the first “Dr.” is transcribed as “Doctor”, while the second one is transcribed are “Drive”. Next, “4305” is expanded to “forty three oh five”. Then, a syntactic parser (recognizing the part of speech for each word in the sentence) is used to label the text. One of the functions of syntax is to disambiguate the sentence constituent pieces in order to generate the correct string of phones, with the help of a pronunciation dictionary. Thus for the above sentence, the verb “lives” is disambiguated from the (potential) noun “lives” (plural of “life). If the dictionary look-up fails, general letter-to-sound rules are used (Dictionary rules module 103). Finally, with punctuated text, syntactic and phonological information available, a prosody module predicts sentence phrasing and word accents, and, from those, generates targets for example, for fundamental frequency, phoneme duration, and amplitude. The second block 110 in FIG. 1 assembles the units according to the list of targets set by the front-end. It is this block that is responsible for the innovation towards more natural sounding synthetic speech with reference to a store of sounds. Then the selected units are fed into a back-end speech synthesizer 120 that generates the speech waveform for presentation to the listener.
This known arrangement simply does not accommodate well an arrangement in which TTS is combined with recorded prompts.