Modern applications often require generation of voice messages containing a sequence of words. The voice messages may be generated from stored speech signal segments. Each signal segment corresponds to one of a plurality of individual words or phrases in a defined dictionary. Typically, the signal segments are digitally sampled versions of spoken words, stored in a computer readable memory. The segments are concatenated to form the complete voice message. The dictionary varies from application to application, but typically contains words or phrases that may be combined with other words or phrases in the dictionary to produce a large variety of meaningful voice messages. For example, the dictionary may contain the spoken numerals 0-9; the letters of the alphabet A-Z; common words; or any combination of these.
Systems used in providing telephony services generate voice messages containing spoken telephone numbers in response to a caller directory inquiry. Similar systems may be used to generate voice messages containing spoken versions of zip or postal codes; spelled names or words; monetary amounts (for example "two dollars and eight cents"); or the like. Telephone "caller identification" devices may use such systems to speak the phone number of a caller. As well, voice mail systems generate messages comprised of system produced voice messages and user recorded messages.
Present systems that generate voice messages typically do so by producing a signal formed by sequentially reproducing stored signal segments corresponding to each individual word or phrase in a dictionary. The stored segments are typically independent and are formed by sampling unrelated recordings of the words and phrases in the dictionary. Each reproduced signal segment is spaced from the next by a signal segment corresponding to a gap of silence or a pause. In a generated voice message, the pauses allow a listener to perceive a connection between the end of one word and the beginning of the next. However, the use of pauses combined with the use of signal segments corresponding to unrelated spoken words cause the generated voice message to sound staccato, and unnatural.
One solution to address the problem of staccato speech has involved storing signal segments corresponding to several versions of each word or phrase in a dictionary. Each version has a different intonation. In one implementation, for example, an automated directory assistance service uses signal segments corresponding to three versions of each numeral from 0-9 to generate voice messages containing spoken digits of telephone numbers. Signal segments corresponding to versions of each digit having a rising, falling, and level intonation are stored. Depending on whether a digit is generated at the beginning, end or middle of a sequence of digits, signal segments corresponding to the version of the digit having rising, falling or level intonation, as required, are used. A resulting voice message containing a sequence of digits sounds more natural to the listening ear. The listener perceives the unrelated digits as being related by their relative intonation. However, such a system like other known systems produces the sequence of words from signal segments corresponding to individual, substantially unrelated, words. Again, fixed pauses are generated between words.
These known systems ignore the natural interrelation between adjacent words, in a sequence of spoken of words. As noted, the speech produced by these systems sounds somewhat unnatural. Moreover, because gaps of silence of fixed duration are typically generated between individual words, the produced voice message is somewhat longer than a naturally spoken sequence of words. Even if the gaps are extremely short, the transitions to and from the gaps are both time consuming and create the unnatural sounding speech.
The present invention attempts to overcome some of the disadvantages of known systems.