In the traditional segment-based speech coding, the messages of prosody corresponding to speech segments are usually directly encoded with quantitative methods over prosodic features, without considering the use of prosodic model with linguistic meanings for performing parameterized prosody coding. Some methods of the mentioned traditional speech coding are performed with the corresponding duration and speech pitch contour of the phonemes in the syllables. The coding is to use pre-stored representative duration and grouping templates of pitch contour of the phonemes in the syllables as the duration and the pitch contour of the phonemes in the syllables, but not consider the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
Coding to pitch contour is to use the linear segments of the pitch contour to represent the values thereof. The messages of the pitch contour are represented with the slope as well as endpoint values of those linear segments. Representative linear segment templates are stored in a codebook, which is used for the coding to pitch contour. The method is simple, but without considering the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
There is a method of scalar quantization to the pitch contour of phoneme, which is to use the average pitch and the slope of the phoneme to represent the pitch contour of the phoneme, and to perform scalar quantization to the average pitch and the slope of the phrase, which does not consider the prosody generating model. The coded speech with the mentioned method of scalar quantization is hard to be applied to prosodic transformation thereto.
Another method is to normalize the duration and the average pitch of phoneme by subtracting the average duration and average pitch contour of the corresponding phoneme type from observed value of the duration and the pitch contour and finally performing scalar quantization to the normalized phoneme duration and the pitch contour. Such a method may reduce the transmission data rate. Doing without considering the prosody generating model, the coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
One another method is to segment the speech into segments of different number of frames, each of which has a pitch contour represented by the average pitch of the frame, while an energy contour is represented with vector quantization, without considering the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
There is also a method of piecewise linear approximation (PLA) for use to represent the pitch. The PLA information includes the pitch value and time information of the endpoints of the segment and the pitch value and time information of the critical points. Some articles introduce scalar quantization for representing those messages, while use vector quantization for representing the PLA information. Some articles introduce traditional method of frame-based speech coder, which performs quantization to the pitch information of each frame and may accurately indicate the pitch information, but suffers high data rate.
Some articles introduce the method of quantizing the pitch contour of a segment with pitch contour templates stored in the codebook and encoding the templates. The method may encode the pitch information with very low data rate, but with higher distortion.
The encoding process of the prior arts can be summarized as below: (1) segmentation of the speech into segments; and (2) encoding of the spectrum and the prosodic information of the segments. Usually, for one segment, the corresponding phoneme, syllable or the acoustic unit defined by the system can be obtained. The segmentation can be performed by automatic speech recognition or can be done by forced alignment given known phoneme, syllable or the acoustic unit defined by the system. Then, each segment is encoded with the spectrum information and prosodic message thereof.
On the other hand, the reconstruction of the encoded speech by the segment-based speech encoder includes the following steps: (1) decoding and reconstruction of the spectrum and prosodic information; and (2) speech synthesis.
Most of the prior art technologies pay more attention on the encoding of spectrum information, but less on the aspect of the encoding of prosodic information. The prior art often encodes the prosodic information by means of quantization, without considering the model behind the prosodic information, and therefore hard to obtained lower encoding data rate and to perform speech transformation for the encoded speech by systematic methods.
In order to overcome the drawbacks in the prior art, a speech-synthesizing device, and more particularly to a streaming encoder, prosody information encoding device, prosody-analyzing device and device and method for speech synthesizing is provided. The novel design in the present invention not only solves the problems described above, but also is easy to be implemented. Thus, the present invention has the utility for the industry.