The invention relates to electronic devices, and, more particularly, to speech coding, transmission, storage, and synthesis circuitry and methods.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. One digital speech method, linear prediction (LP), models the vocal track as a filter with excitation to mimic human speech. In this approach only the parameters of the filter and the excitation of the filter are transmitted across the communication channel (or stored), and a synthesizer regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the parameters requires fewer bits than direct representation of the speech signal, so a reasonable LP vocoder can operate at bits rates as low as 2–3 Kb/s (kilobits per second), whereas the public telephone system uses 64 Kb/s (8-bit PCM codewords at 8,000 samples per second). See for example, McCree et al, A 2.4 Kbit/s MELP Coder Candidate for the New U.S. Federal Standard, Proc. IEEE ICASSP 200 (1996) and U.S. Pat. No. 5,699,477.
The speech signal can be roughly divided into voiced and unvoiced regions. The voiced speech is periodic with a varying level of periodicity. The unvoiced speech does not display any apparent periodicity and has a noisy character. Transitions between voiced and unvoiced regions as well as temporary sound outbursts (e.g., plosives like “p” or “t”) are neither periodic nor clearly noise-like. In low-bit rate speech coding, applying different techniques to various speech regions can result in increased efficiency and perceptually more accurate signal representation. In coders which use linear prediction, the linear LP-synthesis filter is used to generate output speech. The excitation of the LP-synthesis filter models the LP-analysis residual which maintains speech characteristics: it is periodic for voiced speech, noise-like for unvoiced segments, and neither for transitions or plosives. In the Code Excited Linear Prediction (CELP) coder, the LP excitation is generated as a sum of a pitch synthesis-filter output (sometimes implemented as an entry in an adaptive codebook) and an innovation sequence. The pitch-filter (adaptive codebook) models the periodicity of the voiced speech. The unvoiced segments are generated from a fixed codebook which contains stochastic vectors. The codebook entries are selected based on the error between input (target) signal and synthesized speech making CELP a waveform coder. T. Moriya and M. Honda “Speech Coder Using Phase Equalization and Vector Quantization”, Proc. IEEE ICASSP 1701 (1986), describe a phase equalization filtering to take advantage of perceptual redundancy in slowly varying phase characteristics and thereby reduce the number of bits required for coding.
Sub-frame pitch and multistage vector quantization is described in A. McCree and J. DeMartin, “A 1.7 kb/s MELP Coder with Improved Analysis and Quantization”, Proc. IEEE ICASSP 593–596 (1998).
In the Mixed Excitation Linear Prediction (MELP) coder, the LP excitation is encoded as a superposition of periodic and non-periodic components. The periodic part is generated from waveforms, each representing a pitch period, encoded in the frequency domain. The non-periodic part consists of noise generated based on signal correlations in individual frequency bands. The MELP-generated voiced excitation contains both (periodic and non-periodic) components while the unvoiced excitation is limited to the non-periodic component. The coder parameters are encoded based on an error between parameters extracted from input speech and parameters used to synthesize output speech making MELP a parametric coder. The MELP coder, like other parametric coders, is very good at reconstructing the strong periodicity of steady voiced regions. It is able to arrive at a good representation of a strongly periodic signal quickly and well adjusts to small variations present in the signal. It is, however, less effective at modeling non-periodic speech segments like transitions, plosive sounds, and unvoiced regions. The CELP coder, on the other hand, by matching the target waveform directly, seems to do better than MELP at representing irregular features of speech. It is capable of maintaining strong signal periodicity but, at low bit-rates, it takes CELP longer to “build up” a good representation of periodic speech. The CELP coder is also less effective at matching small variations of strongly periodic signals.
These observations suggest that using both CELP and MELP (waveform and parametric) coders to a represent speech signal would provide many benefits as each coder seems to be better at representing different speech regions. The MELP coder might be most effectively used in periodic regions and the CELP coder might be best for unvoiced, transitions, and other non-periodic segments of speech. For example, D. L. Thomson and D. P. Prezas, “Selective Modeling of the LPC Residual During Unvoiced Frames; White Noise or Pulse Excitation,” Proc. IEEE ICASSP, (Tokyo), 3087–3090 (1986) describes an LPC vocoder with a multipulse waveform coder, W. B. Kleijn, “Encoding Speech Using Prototype Waveforms,” 1 IEEE Trans. Speech and Audio Proc., 386–399 (1993) describes a CELP coder with the Prototype Waveform Interpolation coder, and E. Shlomot, V. Cuperman, and A. Gersho, “Combined Harmonic and Waveform Coding of Speech at Low Bit Rates,” Proc. IEEE ICASSP (Seattle), 585–588 (1998) describes a CELP coder with a sinusoidal coder.
Combining a parametric coder with a waveform coder generates problems of making the two work together. In known methods, the initial phase (time-shift) of the parametric coder is estimated based on past samples of the synthesized signal. When the waveform coder is to be used, its target-vector is shifted based on the drift between synthesized and input speech. The solution works well for some types of input but it is not robust: it may easily break when the system attempts to switch frequently between coders, particularly in voiced regions.
In short, the speech output from such hybrid vocoders at about 4 kb/s is yet not an acceptable substitute for toll-quality speech in many applications.