Speech coding is an important field of speech technology. The original speech signal is analog. The transmission of original speech signal takes a huge bandwidth and it is error prone. For several decades, coding methods and systems have been developed, to compress the speech signal to a low-bit-rate digital signal for transmission. The current status of the technology is summarized in a number of monographs, for example, Part C of “Springer Handbook of Speech Processing”, Springer Verlag 2007; and “Digital Speech”, Second Edition, by A. M. Kondoz, Wiley, 2004. There are several hundreds of patents and patent applications with “speech coding” in the title. The system of speech coding has two components. The encoder converts speech signal to a compressed digital signal. The decoder converts the compressed digital signal back into analog speech signal. The current technology for low bit rate speech coding is based on the following principles:
For encoding, first, speech signal is segmented into frames with a fixed duration. Second, a program determines whether a frame is voiced or unvoiced. Third, for voiced frames, find the pitch period in the frame. Fourth, extract the linear predictive code (LPC) of each frame. The voicedness index (voice or unvoiced), the pitch period, and LPC coefficients are then quantized to a limited number of bits, to become the encoded speech signal for transmission. In the decoding process, the voiced segments and the unvoiced segments are treated differently. For voiced segments, a string of pulses are generated according to the pitch period, and then filtered by the LPC based spectrum to generate the voiced sound. For unvoiced segments, a noise signal is generated, and then filtered by the LPC based spectrum to generate an unvoiced consonant. Because pitch period is a property of the frame, each frame must be longer than the maximum pitch period of human voice, which is typically 25 msec. The frame must be multiplied with a window function, typically a Hamming window function, to make the ends approximately matching. To ensure that no information is neglected, each frame must overlap with the previous frame and the following frame, with a typical frame shift of 10 msec.
The quality of LPC-based speech coding is limited by the intrinsic properties of the LPC coefficients, which is pitch-asynchronous, and has a rather small number of parameters because of non-converging behavior when the number of coefficients is increased. The usual limit is 10 to 16 coefficients. The quality of the LPC-based speech coding is always compared with the 8-kHz sample rate 8 bit voice signal, the so-called legacy telephone standard, toll quality speech signal, or narrow-band speech signal. Coming to the 21th century, all voice recording device and voice production device can provide CD-quality speech signal, with at least 32 kHz sample rate and 16 bit resolution. Toll-quality speech signal is considered poor. Speech coding should be able to generate quality comparable to the CD-quality speech signal.
It is well known that the voiced speech signal is pseudo-periodic, and the LPC coefficients become inaccurate at the onset time of a pitch period. To improve the quality of speech coding, pitch-synchronous speech coding has been proposed, researched and patented. See for example, R. Taori et al, “Speech Compression Using Pitch Synchronous Interpolation”, Proceedings of ICASSP-1995, vol. 1, pages 512-515; H. Yang et al., “Pitch Synchronous Multi-Band (PSMB) Speech Coding”, Proceedings of ICASSP-1995, vol. 1, page 516-519; C. Sturt et al., “LSF Quantization for Pitch Synchronous Speech Coders”, Proceedings of ICASSP-2003, vol. 2, pages 165-168; and U.S. Pat. No. 5,864,797 by M. Fujimoto, “Pitch-synchronous Speech Coding by Applying Multiple Analysis to Select and Align a Plurality of Types of Code Vectors”, Jan. 26, 1999. They showed that by using pitch-synchronous LPC coefficients or using pitch-synchronous multi-band coding, the quality can be improved.
In the two previous patents by the current applicant (U.S. Pat. No. 8,719,030 entitled “System and Method for Speech Synthesis”, U.S. Pat. No. 8,942,977 entitled “System and Method for Speech Recognition Using Pitch-Synchronous Spectral Parameters”), a pitch-synchronous segmentation scheme and a new mathematical representation, timbre vectors, are proposed, as an alternative to the fixed-window-size segmentation and LPC coefficients. The new methods enable the parameterization and reproduction of wide-band speech signal with high fidelity, thus provide a new method of speech coding, especially for CD-quality speech signals. The current patent application discloses systems and methods of speech coding using timbre vectors.