The present invention relates to speech systems and more particularly to a system for encoding speech signals into a compact representation that includes speech segments and prosodic parameters that permits accurate and natural sounding playback.
Speech analysis systems include speech recognition systems and speech synthesis systems. Automatic speech recognition systems, also known as speech-to-text systems, include a computer (hardware and software) that analyzes a speech signal and produces a textual representation of the speech signal. FIG. 1 illustrates a functional block diagram of a prior art automatic speech recognition system. An automatic speech recognition system can include an analog-to-digital (A/D) converter 10 for digitizing the analog speech signal, a speech analyzer 12 and a language analyzer 14. Initially, the system stores a dictionary including a pattern (i.e., digitized waveform) and textual representation for each of a plurality of speech segments (i.e., vocabulary). These speech segments may include words, syllables, diphones, etc. The speech analyzer divides the speech into a plurality of segments, and compares the patterns of each input segment to the segment patterns in the known vocabulary using pattern recognition or pattern matching in attempt to identify each segment.
Language analyzer 14 uses a language model, which is a set of principles describing language use, to construct a textual representation of the received speech segments. In other words, the speech recognition system uses a combination of pattern recognition and sophisticated guessing based on some linguistic and contextual knowledge. For example, certain word sequences are much more likely to occur than others. The language analyzer may work with the speech analyzer to identify words or resolve ambiguities between different words or word spellings. However, due to a limited vocabulary and other system limitations, a speech recognition system can guess incorrectly. For example, a speech recognition system receiving a speech signal having an unfamiliar accent or unfamiliar words may incorrectly guess several words, resulting in a textual output which can be unintelligible.
One proposed speech recognition system is disclosed in Alex Waibel, "Prosody and Speech Recognition, Research Notes In Artificial Intelligence," Morgan Kaufman Publishers, 1988 (ISBN 0-934613-70-2).
Waibel discloses a speech-to-text system (such as an automatic dictation machine) that extracts prosodic information or parameters from the speech signal to improve the accuracy of text generation. Prosodic parameters associated with each speech segment may include, for example, the pitch (fundamental frequency F.sub.0) of the segment, duration of the segment, and amplitude (or stress or volume) of the segment. Waibel's speech recognition system is limited to the generation of an accurate textual representation of the speech signal. After generating the textual representation of the speech signal, any prosodic information that was extracted from the speech signal is discarded. Therefore, a person or system receiving the textual representation output by a speech-to-text system will know what was said, but will not know how it was said (i.e., pitch, duration, rhythm, intonation, stress).
Similarly, as illustrated in FIG. 2, speech synthesis systems exist for converting text to synthesized speech, and can include, for example, a language synthesizer 16, a speech synthesizer 18 and a digital-to-analog (D/A) converter 20. Speech synthesizers use a plurality of stored speech segments and their associated representation (i.e., vocabulary) to generate speech by, for example, concatenating the stored speech segments. However, because no information is provided with the text as to how the speech should be generated (i.e., pitch, duration, rhythm, intonation, stress), the result is typically an unnatural or robot sounding speech. As a result, automatic speech recognition (speech-to-text) systems and speech synthesis (text-to-speech) systems may not be effectively used for the encoding, storing and transmission of natural sounding speech signals. Moreover, the areas of speech recognition and speech synthesis are separate disciplines. Speech recognition systems and speech synthesis systems are not typically used together to provide for a complete system that includes both encoding an analog signal into a digital representation and then decoding the digital representation to reconstruct the speech signal. Rather, speech recognition systems and speech synthesis are employed independently of one another, and therefore, do not typically share the same vocabulary and language model.
A functional block diagram of a prior art system which may be used for encoding, storage and transmission of audio signals is illustrated in FIG. 3. An audio signal, which may include a speech signal, is digitized by an A/D converter 22. A compressor/decompressor (codec) 24 compresses the digitized audio signal by, for example, removing superfluous or unnecessary information. The digitized audio may be transmitted over a transmission medium 26. At the receiving end, the signal is decompressed by a codec 28 and converted to an analog signal by a D/A converter 30 for output to a speaker 32. Even though the system of FIG. 3 can provide excellent speech rendering, this technique requires a relatively high bit rate (bandwidth) for transmission and a very large storage capacity for storing the digitized speech information, and provides no flexibility.
Therefore, a need has arisen for a speech system that provides a compact representation of a speech signal for efficient transmission, storage, etc., and which permits accurate (i.e., what was said) and natural sounding (i.e., how it was said) reconstruction of the speech signal.