The present invention relates to electronic communications, and more particularly to digital speech coding methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. One digital speech method, linear prediction (LP), models the vocal track as a filter with an excitation to mimic human speech. In this approach only the parameters of the filter and the excitation of the filter are transmitted across the communication channel (or stored), and a synthesizer regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the parameters requires fewer bits than direct representation of the speech signal, so a reasonable LP vocoder can operate at bits rates as low as 2-3 Kb/s (kilobits per second), whereas the public telephone system uses 64 Kb/s (8-bit PCM codewords at 8,000 samples per second).
A speech signal can be roughly divided into voiced and unvoiced regions. The voiced speech is periodic with a varying level of periodicity, but the unvoiced speech does not display any apparent periodicity and has a noisy character. Transitions between voiced and unvoiced regions as well as temporary sound outbursts (e.g., plosives like “p” or “t”) are neither periodic nor clearly noise-like. In low-bit rate speech coding, applying different techniques to various speech regions can result in increased efficiency and perceptually more accurate signal representation. In coders which use linear prediction, the linear LP-synthesis filter is used to generate output speech. The excitation for the LP-synthesis filter models the LP-analysis residual to maintain speech characteristics: it is periodic for voiced speech, noise-like for unvoiced segments, and neither for transitions or plosives.
In a Code Excited Linear Prediction (CELP) coder, the LP excitation is generated as a sum of a pitch synthesis-filter output (sometimes implemented as an entry in an adaptive codebook) and a innovation sequence. The pitch-filter (adaptive codebook) models the periodicity of voiced speech. Sparse codebooks can efficiently encode pulses in tracks for excitation synthesis; see Peng et al, U.S. Pat. No. 6,236,960. The unvoiced segments are generated from a fixed codebook which contains stochastic vectors. The codebook entries are selected based on the error between input (target) signal and synthesized speech making CELP a waveform coder. T. Moriya and M. Honda, Speech Coder Using Phase Equalization and Vector Quantization, Proc. IEEE ICASSP 1701 (1986) and U.S. Pat. No. 4,850,022, describe a phase equalization filtering to take advantage of perceptual redundancy in slowly varying phase characteristics and thereby reduce the number of bits required for coding.
In Mixed Excitation Linear Prediction (MELP) coder, the LP excitation is encoded as a superposition of periodic and non-periodic components. The periodic part is generated from waveforms, each representing a pitch period, encoded in the frequency domain. The non-periodic part consists of noise generated based on signal correlations in individual frequency bands. The MELP-generated voiced excitation contains both (periodic and non-periodic) components while the unvoiced excitation is limited to the non-periodic component. The coder parameters are encoded based on an error between parameters extracted from input speech and parameters used to synthesize output speech making MELP a parametric coder. The MELP coder, like other parametric coders, is very good at reconstructing the strong periodicity of steady voiced regions. It is able to arrive at a good representation of a strongly periodic signal quickly and adjusts well to small variations present in the signal. It is, however, less effective at modeling non-periodic speech segments like transitions, plosive sounds, and unvoiced regions. The CELP coder, on the other hand, by matching the target waveform directly, seems to do better than MELP at representing irregular features of speech. CELP is capable of maintaining strong signal periodicity but, at low bit-rates it takes longer to “build up” a good representation of periodic speech. A CELP coder is also less effective at matching small variations of strongly periodic signals.
Combining a parametric coder with a waveform coder generates problems of making the two work together. In known methods, the initial phase (time-shift) of the parametric coder is estimated based on past samples of the synthesized signal. When the waveform coder is to be used, its target-vector is shifted based on the drift between synthesized and input speech. The solution works well for some types of input but it is not robust: it may easily break when the system attempts to switch frequently between coders, particularly in voiced regions.
Gersho et al, U.S. Pat. No. 6,233,550, provide a hybrid coder with three speech classifications and coding models: steady-state voiced (harmonic), stationary unvoiced (noise-like), and “transitory” or “transition” speech.
However, the speech output from such hybrid coders at about 4 kb/s is not yet an acceptable substitute for toll-quality speech in many applications