The present invention relates to voice messaging systems, wherein pitch and LPC parameters (and usually other excitation information too) are encoded for transmission and/or storage, and are decoded to provide a close replication of the original speech input.
The present invention also relates to speech recognition and encoding systems, and to any other system wherein it is necessary to estimate the pitch of the human voice.
The present invention is particularly related to linear predictive coding (LPC) systems for (and methods of) analyzing or encoding human speech signals. In LPC modeling generally, each sample in a series of samples is modeled (in the simplified model) as a linear combination of preceding samples, plus an excitation function: ##EQU1## where u.sub.k is the LPC residual signal. That is, u.sub.k represents the residual information in the input speech signal which is not predicted by the LPC model. Note that only N prior signals are used for prediction. The model order (typically around 10) can be increased to give better prediction, but some information will always remain in the residual signal u.sub.k for any normal speech modelling application.
Within the general framework of LPC modeling, many particular implementations of voice analysis can be selected. In many of these, it is necessary to determine the pitch of the input speech signal. That is, in addition to the formant frequencies, which in effect correspond to resonances of the vocal tract, the human voice also contains a pitch, modulated by the speaker, which corresponds to the frequency at which the larynx modulates the airstream. That is, the human voice can be considered as an excitation function applied to an acoustic passive filter, and the excitation function will generally appear in the LPC residual function, while the characteristics of the passive acoustic filter (i.e., the resonance characteristics of mouth, nasal cavity, chest, etc.) will be molded by the LPC parameters. It should be noted that during unvoiced speech, the excitation function does not have a well-defined pitch, but instead is best modeled as broad band white noise or pink noise.
Estimation of the pitch period is not completely trivial. Among the problems is the fact that the first formant will often occur at a frequency close to that of the pitch. For this reason, pitch estimation is often performed on the LPC residual signal, since the LPC estimation process in effect deconvolves vocal tract resonances from the excitation information, so that the residual signal contains relatively less of the vocal tract resonances (formants) and relatively more of the excitation information (pitch). However, such residual-based pitch estimation techniques have their own difficulties. The LPC model itself will normally introduce high frequency noise into the residual signal, and portions of this high frequency noise may have a higher spectral density than the actual pitch which should be detected. One prior art solution to this difficulty is simply to low pass filter the residual signal at around 1000 Hz. This removes the high frequency noise, but also removes the legitimate high frequency energy which is present in the unvoiced regions of speech, and renders the residual signal virtually useless for voicing decisions.
A cardinal criterion in voice messaging applications is the quality of speech reproduced. Prior art systems have had many difficulties in this respect. In particular, many of these difficulties relate to problems of accurately detecting the pitch and voicing of the input speech signal.
It is typically very easy to incorrectly estimate a pitch period at twice or half its value. For example, if correlation methods are used, a good correlation at a period P guarantees a good correlation at period 2P, and also means that the signal is more likely to show a good correlation at period P/2. However, such doubling and halving errors produce very annoying degradation in voice quality. For example, erroneous halving of the pitch period will tend to produce a squeaky voice, and erroneous doubling of the pitch period will tend to produce a coarse voice. Moreover, pitch period doubling or halving is very likely to occur intermittently, so that the synthesized voice will tend to crack or to grate, intermittently.
Thus, it is an object of the present invention to provide a voice messaging system wherein errors of pitch period doubling and halving are avoided.
It is a further object of the present invention to provide a voice messaging system wherein voices are not reproduced with erroneous squeaky, cracking, coarse, or grating qualities.
A related difficulty in prior art voice messaging systems is voicing errors. If a section of voiced speech is incorrectly determined to be unvoiced, the reproduced speech will sound as though it was whispered rather than spoken speech. If a section of unvoiced speech is incorrectly estimated to be voiced, the regenerated speech in this section will show a buzzing quality.
Thus, it is an object of the present invention to provide a voice messaging system, wherein voicing errors are avoided.
It is a further object of the present invention to provide a voice messaging system wherein spurious buzz and dropouts do not appear in the reconstituted speech.
The pitch usually varies fairly smoothly across frames. In the prior art, tracking of pitch across frames has been attempted, but the interrelation of the pitch and voicing decisions can pose difficulties. That is, where the voicing decision is made separately, the voicing and pitch decisions must still be reconciled. Thus, this method poses a heavy processor load.
It is a further object of the invention to provide a voice messaging system wherein pitch is tracked consistently with respect to plural frames in the sequence of frames, without imposing a heavy processor load.
It is a further object of the present invention to provide a voice messaging system wherein voicing decisions are made consistently across a sequence of frames.
It is a further object of the present invention to provide a voice messaging system wherein pitch and voicing decisions are made consistently across a sequence of frames, without imposing a heavy processor load.
The present invention uses an adaptive filter to filter the residual signal. By using a time-varying filter which has a single pole at the first reflection coefficient (k.sub.1 of the speech input), the high frequency noise is removed from the voiced periods of speech, but the high frequency information in the unvoiced speech periods is retained. The adaptively filtered residual signal is then used as the input for the pitch decision.
It is necessary to retain the high frequency information in the unvoiced speech periods to permit better voicing/unvoicing decisions. That is, the "unvoiced" voicing decision is normally made when no strong pitch is found, that is when no correlation lag of the residual signal provides a high normalized correlation value. However, if only a low-pass filtered portion of the residual signal during unvoiced speech periods is tested, this partial segment of the residual signal may have spurious correlations. That is, the danger is that the truncated residual signal which is produced by the fixed low-pass filter of the prior art does not contain enough data to reliably show that no correlation exists during unvoiced periods, and the additional band width provided by the high-frequency energy of unvoiced periods is necessary to reliably exclude the spurious correlation lags which might otherwise be found.
Thus, it is an object of the present invention to provide a method for filtering high-frequency noise out during voice speech periods, without making erroneous voicing decisions during unvoiced speech periods.
It is a further object of the invention to provide a voice messaging system which does not make erroneous high-frequency pitch assignments during voiced speech periods, and which also does not make erroneous voicing decisions during unvoiced speech periods.
It is a further object of the present invention to provide a system for making pitch and voicing estimates of speech which disregards high-frequency noise during voiced speech segments and which uses high-frequency information during unvoiced speech segments.
Improvement in pitch and voicing decisions is particularly critical for voice messaging systems, but is also desirable for other applications. For example, a word recognizer which incorporated pitch information would naturally require a good pitch estimation procedure. Similarly, pitch information is sometimes used for speaker verification, particularly over a phone line, where the high frequency information is partially lost. Moreover, for long-range future recognition systems, it would be desirable to be able to take account of the syntactic information which is denoted by pitch. Similarly, a good analysis of voicing would be desirable for some advanced speech recognition systems, e.g., speech to text systems.
Thus, it is a further object of the present invention to provide a method for making optimal pitch decisions in a series of frames of input speech.
It is a further object of the present invention to provide a method for making optimal voicing decisions in a sequence of frames of input speech.
It is a further object of the present invention to provide a method for making optimal speech and voicing decisions in a sequence of frames of input speech.
The first reflection coefficient k.sub.1 is approximately related to the high/low frequency energy ratio and a signal. See R. J. McAulay, "Design of a Robust Maximum Likelihood Pitch Estimator for Speech and Additive Noise," Technical Note, 1979--28, Lincoln Labs, June 11, 1979, which is hereby incorporated by reference. For k.sub.1 close to -1, there is more low frequency energy in the signal than high-frequency energy, and vice versa for k.sub.1 close to 1. Thus, by using k.sub.1 to determine the pole of a 1-pole deemphasis filter, the residual signal is low pass filtered in the voiced speech periods and is high pass filtered in the unvoiced speech periods. This means that the formant frequencies are excluded from computation of pitch during the voiced periods, while the necessary high-band width information is retained in the unvoiced periods for accurate detection of the fact that no pitch correlation exists.
Preferably a post-processing dynamic programming technique is used to provide not only an optimal pitch value but also an optimal voicing decision. That is, both pitch and voicing are tracked from frame to frame, and a cumulative penalty for a sequence of frame pitch/voicing decisions is accumulated for various tracks to find the track which gives optimal pitch and voicing decisions. The cumulative penalty is obtained by imposing a frame error is going from one frame to the next. The frame error preferably not only penalizes large deviations in pitch period from frame to frame, but also penalizes pitch hypotheses which have a relatively poor correlation "goodness" value, and also penalizes changes in the voicing decision if the spectrum is relatively unchanged from frame to frame. This last feature of the frame transition error therefore forces voicing transitions towards the points of maximal spectral change.
According to the present invention there is provided:
A voice messaging system for receiving a human speech signal and reconstituting said human speech signal at a receiver which is spatially or temporally remote, comprising:
input means for receiving an analog input speech signal, said input speech signal being organized into a sequence of frames;
LPC analysis means connected to said receiving means for analyzing said input speech signal according to an LPC (Linear Predictive Coding) model to provide LPC parameters;
pitch extraction means for determining a plurality of pitch candidates for each of said frames in said sequence;
optimization means for performing dynamic programming, with respect both to said pitch candidates for each frame and also to a voiced/unvoiced decision for each frame, to determine both an optimal pitch and an optimal voicing decision for each frame in the context of said sequence of frames; and
means for encoding said LPC parameters and said optimal pitch and voicing decision for each frame.