The present invention relates generally to text-to-speech synthesis, and more particularly to text-to-speech synthesis in a communication system using native speech coding.
Radio communication devices, such as cellular phones, are no longer viewed as voice only devices. With the advent of data based wireless services available to consumers, some serious problems arise for the conventional cellular phones. For example, cellular phones are currently only capable of presenting data services in text format on a small screen. This requires screen scrolling or other user manipulation in order to get the data or message. Also, comparing to landline systems, a wireless system has much higher data error rate and faces spectrum constraints, which makes providing real-time streaming audio, i.e. real-audio, to cellular users impractical. One way to deal with these problems is text-to-speech encoding.
The process of converting text to speech is generally broken down into two major blocks: text analysis and speech synthesis. Text analysis is the process by which text is converted into a linguistic description that can be synthesized. This linguistic description generally consists of the pronunciation of the speech to be synthesized along with other properties that determine the prosody of the speech. These other properties can include (1) syllable, word, phrase, and clause boundaries; (2) syllable stress; (3) part-of-speech information; and (4) explicit representations of prosody such as are provided by the ToBI labeling system, as known in the art, and further described in 2nd International Conference on Spoken Language Processing (ICSLP92): TOBI: xe2x80x9cA Standard for Labeling English Prosodyxe2x80x9d, Silverman et al, (October 1992).
The pronunciation of speech included in the linguistic description is described as a sequence of phonetic units. These phonetic units are generally phones or phonics, which are particular physical speech sounds, or allophones, which are particular ways in which a phoneme may be expressed. (A phoneme is a speech sound perceived by the speakers of a language). For example, the English phoneme xe2x80x9ctxe2x80x9d may be expressed as a closure followed by a burst, as a glottal stop, or as a flap. Each of these represents different allophones of xe2x80x9ctxe2x80x9d. Different sounds that may be produced when xe2x80x9ctxe2x80x9d is expressed as a flap represent different phonics. Other phonetic units that are sometimes used are demisyllables and diphones. Demisyllables are half-syllables and diphones are sequences of two phonics.
Speech synthesis can be generated from phonics using a rule-based system. For example, the phonetic unit has a target phenome acoustic parameters (such as duration and intonation) for each segment type, and has rules for smoothing the parameter transitions between the segments. In a typical concatenation system, the phonetic component has a parametric representation of a segment occurring in natural speech and concatenates these recorded segments, smoothing the boundaries between segments using predefined rules. The speech is then processed through a vocoder for transmission. Voice coders, such as vector-sum or code excited linear prediction (CELP) vocoders are in general use in digital cellular communication devices. For example, U.S. Pat. No. 4,817,157, which is hereby incorporated by reference, describes such a vocoder implementation as used for the Global System for Mobile (GSM) communication system among others.
Unfortunately, the text-to-speech process as described above is computationally complex and extensive. For example, in existing digital communication devices, vocoder technology already uses the limits of computational power in a device in order to maintain voice quality at its highest possible level. However, the text-to-speech process described above requires further signal processing in addition to the vocoder processing. In other words, the process of converting text to phonics, applying acoustic parameters rules for each phonic, concatenation to provide a voiced signal, and voice coding require more processing power than just voice coding alone.
Accordingly, there is a need for an improved text-to-speech coding system that reduces the amount of signal processing required to provide a voiced output. In particular, it would be of benefit to be able to use the existing native speech coding incorporated into a communication device. It would also be advantageous if current low-cost technology could be used without the requirement for customized hardware.
The present invention finds use in communication devices, such as radiotelephones for example, that have audio capabilities that can take advantage of text-to-speech conversion of text messages.
One aspect of the present invention uses an existing vocoder with a stored code table containing coded speech parameters for use in text-to-speech conversion. These native speech parameters in a communication device can be used without the need to create and store new speech parameters. Instead, the native parameters can be modified if and when needed, such as to provide more natural-sounding language for example.
Another aspect of the present invention involves dividing the text messages into phonics, spaces, and special characters, and wherein white noise is used to emulate spaces between words of text. This saves time and code processing for non-phonics that do not contain any speech information.
Another aspect of the present invention involves the division of text into phonics which can be mapped against native coded speech parameters used in existing communication systems. For example, each distinct phonic can be mapped with a memory location index of predefined phonics in a look-up table to point to a digitized wave file defining equivalent native coded speech parameters from the code table.