The present invention relates to systems for encoding and decoding human speech. In particular, the present invention relates to voice messaging systems. Even more specifically, the present invention relates to integrated voice/data communication/storage systems, wherein reasonably high bandwidth (e.g. 4800 or 9600 baud) digital channels are available.
In voice messaging systems, a transmitter and a receiver are separated in space, time, or both. That is, a speech message is coded at a transmitter station, and the bits corresponding to the encoded speech can then be stored in the transmitter or in a peripheral of the transmitter, to be recalled and regenerated into synthetic speech later, or can be transmitted to a remote receiver location, to be regenerated into human speech immediately or at a later time. That is, the present invention applies to systems wherein a transmitter and a receiver station are connected by a data channel regardless of whether the transmitter or receiver are separated in space or in time or both.
A typical linear predictive coding (LPC) baseband speech coding system is outlined in FIG. 1. The present invention teaches a significant modification and improvement of such a system. After LPC spectral parameters (such as the reflection coefficients k.sub.i or the inverse filter coefficients a.sub.k) have been extracted from a speech input, the speech input is filtered by the LPC analysis filter to generate the residual error signal. That is, the LPC model, as usually simplified, models each input sample as a linear combination of previous input samples with some excitation function: ##EQU1## where u.sub.n is the excitation function. While the average value of the series u.sub.n will be approximately 0, the time series u.sub.n contains important information. That is, the linear predictive coding model is not a perfect model, and significant useful information is not completely modeled by the LPC parameters, and therefore remains in the residual signal u.sub.n. The model order N places some limitation on the fit of the LPC model, but, in any useful speech application some information remains in the residual signal u.sub.n rather than in the LPC parameters.
The LPC model can intuitively be thought of as modeling the actual function of the human voice. That is, the human voice can be considered as an excitation function (either an impulse train generated by the larynx, or white noise generated during unvoiced speech) applied to a passive acoustic filter, corresponding to the acoustic characteristics of the vocal tract. In general, the characteristics of the passive acoustic filter (i.e. the resonance and dampening characteristics of mouth, chest, etc.) will be modeled by the LPC parameters, while the characteristics of the excitation function will generally appear in residual time series u.sub.n.
The phonemic characteristics of speech will typically change at very slow rate, and the acoustic frequency-domain characteristics will change nearly as slowly. Thus, a frame rate is normally chosen to track the acoustic changes in speech over relatively long periods. For example, the frame rate is typically chosen to be somewhere in the neighborhood of 100 Hz, and the acoustic frequency-domain characteristics of the speech signal can be treated as essentially constant over the width of any one frame. By contrast, the speech must be sampled at a Nyquist rate corresponding to the acoustic bandwidth which must be measured. Thus, a typical sampling rate would be 8 kilohertz, so that eighty samples would be found in each frame. The crucial advantage of LPC models is that while the input time series changes once every sample, the LPC parameters change once every frame. The residual series u.sub.n also changes once per sample, but it contains less information than the input time series s.sub.n, and can usually be efficiently modeled at some reduced data rate.
The residual time series u.sub.n can be crudely described using the following information: RMS energy; a voicing bit, to indicate whether the current frame is voiced or unvoiced; and a pitch period, to define the spacing of a train of impulses during periods of voiced speech. During periods of unvoiced speech, the excitation function shows very broad frequency characteristics, and can be fairly well modeled as white noise.
This approximation to the residual time series u.sub.n is very compact, since now all features of the sample-rate input signal s.sub.n have been converted to frame-rate parameters. However, this provides good data compaction, which is highly desirable for any speech encoding system.
However, this simple speech encoding scheme is not adequate for voice messaging systems. In voice messaging systems, a large number of applications are highly sensitive to speech quality. For example, it has been frequently remarked in the literature, for many years, that introduction of voice mail systems into office environments would provide major improvements in white-collar productivity. However, user acceptance of voice messaging systems is very sensitive to quality, since no businessman is likely to routinely use a system which makes his voice sound ludicrous to the person who receives his message. Prior art systems have had many difficulties in satisfying this quality requirement. The other horn of the dilemma is economic, since two factors must be conserved: processor load and data efficiency. If voice encoding is to be performed by microcomputer-based systems in ordinary offices, the processor load for encoding and decoding must be reasonably small. Similarly, if voice messages are to be easily stored and transmitted, their data efficiency (seconds of speech per kilobyte) must be high.
Thus it is an object of the present invention to provide a voice messaging system wherein the quality of speech reproduced is high.
It is a further object of the present invention to provide a voice messaging system wherein a small processor load is imposed.
It is a further object of the present invention to provide a voice messaging system wherein quality of speech is high and a small processor load is imposed.
It is a further object of the present invention to provide a voice messaging system wherein the data efficiency is high.
It is a further object of the invention to provide a voice messaging system wherein the data efficiency is high and the quality of speech produced is very good.
It is a further object of the present invention to provide a voice messaging system wherein the processor load is low, the data efficiency is high, and the quality of speech reproduced is very good.
To achieve high-quality speech, it is necessary to include more information from the residual time series u.sub.n than simply the pitch, energy, and voicing. A Fourier transform of the residual time series u.sub.n is quite adequate. However, this provides more information than is required. It has been found in the prior art that good quality speech can be reproduced by encoding only a fraction of the full bandwidth of the residual signal u.sub.n, and then expanding this fractional bandwidth signal (which is known as the baseband) to provide a full-bandwidth excitation signal at the receiver. In baseband coding methods, the residual signal u.sub.n is transformed to the frequency domain by taking its FFT (fast Fourier transform). A certain number of low frequency samples of the FFT, called the baseband, are selected. This baseband information is encoded and transmitted to the receiver along with pitch, gain, voicing and the LPC parameters. Since only a small segment of the residual frequency spectrum is transmitted to the receiver, the receiver must first construct a reasonable approximation to the full band residual signal. This approximate residual signal u.sub.n can then be used as the excitation function for the LPC synthesis filter. The process of generating the missing higher frequencies in the excitation function at the receiver is usually referred to as high frequency regeneration.
There are several techniques for high frequency regeneration. One of the simplest techniques is to "copy up" the baseband to higher frequency bands. That is, for example, where a 1000 Hz baseband is used, each signal frequency f.sub.k in the baseband would also be copied up to provide the same signal strength at frequencies f.sub.k +1000, f.sub.k +2000, etc., to regenerate an excitation signal at the receiver. The present invention teaches an improvement in such copying-up methods of high frequency regeneration in baseband speech coding.
See Vishwanathan et al, "Design of a Robust Baseband LPC Coder for Speech Transmission Over a 9.6Kb/sec Noisy Channel", IEEE Transactions on Communications, vol. 30, page 663 (1982), and Kang et al, "Multirate Processor", Naval Research Laboratory Report, September 1978; both of which are hereby incorporated by reference.
The prior art high frequency regeneration process produces undesirable characteristics in the synthesized speech. When the available harmonics at the low frequencies are copied up and substituted for the higher harmonics which were originally present in the excitation, the translated harmonics will not always be located at integer multiples of the fundamental pitch frequency. Also, there will typically be phase offset errors between the various copied-up bands. This results in an inappropriate harmonic relation between the strong frequencies in the regenerated high frequency residual portions and the baseband residual portion. This effect, usually called pitch incongruence or harmonic offset, is perceived as annoying background pitches superimposed with the voice message being processed. This effect is most pronounced for high-pitched speakers. This effect is unacceptable in an office-quality voice messaging system.
Thus, it is an object of the present invention to provide a system which can perform baseband speech encoding and decoding without pitch incongruence.
It is a further object of the present invention to provide a speech coding system which can regenerate high-quality speech without pitch incongruence, and which requires only minimal band width for encoding of the residual signal.
It is a further object of the present invention to provide an economical speech coding system without pitch incongruence.
The present invention teaches a variable width baseband coding scheme. During each frame of the input speech, an estimate of the pitch of the input speech is obtained in addition to the LPC parameters. Using this pitch information, the actual width of the base band for each frame is determined to be the width (as close as possible to the nominal base band width) which contains an integral number of multiples of the fundamental pitch frequency.
In addition, the bottom edge of the base band (the first transmitted FFT sample) is chosen to be an FFT sample closest to the fundamental pitch. By this means, subharmonic pitches, spurious pitches, and low-frequency broadband noise cannot exercise an undue influence on the copying up process.
The present invention requires that the pitch of the speech signal be tracked. This can be done in a variety of ways, as will be discussed below.
According to the present invention, there is provided:
A system for encoding an input speech signal, comprising:
an LPC analysis filter, said analysis filter extracting linear predictive coding (LPC) parameters and a corresponding residual signal from said input speech signals;
a pitch estimator, said pitch estimator extracting a pitch frequency from said speech signal;
means for filtering said residual signal to discard frequencies in said residual signal above a baseband frequency, said baseband frequency being selected to be an integral multiple of said pitch frequency; and
an encoder, said encoder encoding information corresponding to said LPC parameters, and to said filtered residual signal.