This invention relates to methods for encoding and synthesizing speech.
Relevant publications include: J. L., Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pp. 378-386, (discusses phase vocoder-frequency-based speech analysis-synthesis system); Quatieri, et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Vol, ASSP34, No. 6, December 1986, pp. 1449-1986, (discusses analysis-synthesis technique based on a sinusoidal representation); Griffin, et al., "Multi-band Excitation Vocoder", Ph.D. Thesis, M.I.T., 1987, (discusses Multi-Band Excitation analysis-synthesis); Griffin, et al., "A New Pitch Detection Algorithm", Int. Conf. on DSP, Florence, Italy, Sept. 5-8, 1984, (discusses pitch estimation); Griffin, et al., "A New Model-Based Speech Analysis/Synthesis System", Proc ICASSP 85, pp. 513-516, Tampa, Fla., Mar. 26-29, 1985, (discusses alternative pitch likelihood functions and voicing measures); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder", S. M. Thesis, M.I.T., May 1988, (discusses a 4.8 kbps speech coder based on the Multi-Band Excitation speech model); McAulay et al., "Mid-Rate Coding Based on a Sinusoidal Representation of Speech", Proc. ICASSP 85 , pp. 945-948, Tampa, Fla., Mar. 26-29, 1985, (discusses speech coding based on a sinusoidal representation); Almieda et al., "Harmonic Coding with Variable Frequency Synthesis", Proc. 1983 Spain Workshop on Sig. Proc. and its Applications", Sitges, Spain, September 1983, (discusses time domain voiced synthesis); Almieda et al., "Variable Frequency Synthesis: An Improved Harmonic Coding Scheme", Proc ICASSP 84, San Diego, Calif., pp. 289-292, 1984, (discusses time domain voiced synthesis); McAulay et al., "Computationally Efficient Sine-Wave Synthesis and its Application to Sinusoidal Transform Coding", Proc. ICASSP 88, New York, N.Y., pp. 370-373, April 1988, (discusses frequency domain voiced synthesis); Griffin et al., "Signal Estimation From Modified Short-Time Fourier Transform", IEEE TASSP, Vol. 32, No. 2, pp. 236-243, April 1984, (discusses weighted overlap-add synthesis). The contents of these publications are incorporated herein by reference.
The problem of analyzing and synthesizing speech has a large number of applications, and as a result has received considerable attention in the literature. One class of speech analysis/synthesis systems (vocoders) which have been extensively studied and used in practice is based on an underlying model of speech. Examples of vocoders include linear prediction vocoders, homomorphic vocoders, and channel vocoders. In these vocoders, speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. For this class of vocoders, speech is analyzed by first segmenting speech using a window such as a Hamming window. Then, for each segment of speech, the excitation parameters and system parameters are determined. The excitation parameters consist of the voiced/unvoiced decision and the pitch period. The system parameters consist of the spectral envelope or the impulse response of the system. In order to synthesize speech, the excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the estimated system parameters.
Even though vocoders based on this underlying speech model have been quite successful in synthesizing intelligible speech, they have not been successful in synthesizing high-quality speech. As a consequence, they have not been widely used in applications such as time-scale modification of speech, speech enhancement, or high-quality speech coding. The poor quality of the synthesized speech is in part, due to the inaccurate estimation of the pitch, which is an important speech model parameter.
To improve the performance of pitch detection, a new method was developed by Griffin and Lim in 1984. This method was further refined by Griffin and Lim in 1988. This method is useful for a variety of different vocoders, and is particularly useful for a Multi-Band Excitation (MBE) vocoder.
Let s(n) denote a speech signal obtained by sampling an analog speech signal. The sampling rate typically used for voice coding applications ranges between 6 khz and 10 khz. The method works well for any sampling rate with corresponding change in the various parameters used in the method.
We multiply s(n) by a window w(n) to obtain a windowed signal s.sub.w (n). The window used is typically a Hamming window or Kaiser window. The windowing operation picks out a small segment of s(n). A speech segment is also referred to as a speech frame.
The objective in pitch detection is to estimate the pitch corresponding to the segment s.sub.w (n). We will refer to s.sub.w (n) as the current speech segment and the pitch corresponding to the current speech segment will be denoted by P.sub.0, where "0" refers to the "current" speech segment. We will also use P to denote P.sub.0 for convenience. We then slide the window by some amount (typically around 20 msec or so), and obtain a new speech frame and estimate the pitch for the new frame. We will denote the pitch of this new speech segment as P.sub.1. In a similar fashion, P.sub.-1 refers to the pitch of the past speech segment. The notations useful in this description are P.sub.0 corresponding to the pitch of the current frame, P.sub.-2 and P.sub.-1 corresponding to the pitch of the past two consecutive speech frames, and P.sub.1 and P.sub.2 corresponding to the pitch of the future speech frames.
The synthesized speech at the synthesizer, corresponding to s.sub.w (n) will be denoted by s.sub.w (n). The Fourier transforms of s.sub.w (n) and s.sub.w (n) will be denoted by S.sub.w (.omega.) and S.sub.w (.omega.).
The overall pitch detection method is shown in FIG. 1. The pitch P is estimated using a two-step procedure. We first obtain an initial pitch estimate denoted by P.sub.I. The initial estimate is restricted to integer values. The initial estimate is then refined to obtain the final estimate P, which can be a non-integer value. The two-step procedure reduces the amount of computation involved.
To obtain the initial pitch estimate, we determine a pitch likelihood function, E(P), as a function of pitch. This likelihood function provides a means for the numerical comparison of candidate pitch values. Pitch tracking is used on this pitch likelihood function as shown in FIG. 2. In all our discussions in the initial pitch estimation, P is restricted to integer values. The function E(P) is obtained by, ##EQU1## where r(n) is an autcorrelation function given by ##EQU2## Equations (1) and (2) can be used to determine E(P) for only integer values of P, since s(n) and w(n) are discrete signals.
The pitch likelihood function E(P) can be viewed as an error function, and typically it is desirable to choose the pitch estimate such that E(P) is small. We will see soon why we do not simply choose the P that minimizes E(P). Note also that E(P) is one example of a pitch likelihood function that can be used in estimating the pitch. Other reasonable functions may be used.
Pitch tracking is used to improve the pitch estimate by attempting to limit the amount the pitch changes between consecutive frames. If the pitch estimate is chosen to strictly minimize E(P), then the pitch estimate may change abruptly between succeeding frames. This abrupt change in the pitch can cause degradation in the synthesized speech. In addition, pitch typically changes slowly; therefore, the pitch estimates from neighboring frames can aid in estimating the pitch of the current frame.
Look-back tracking is used to attempt to preserve some continuity of P from the past frames. Even though an arbitrary number of past frames can be used, we will use two past frames in our discussion.
Let P.sub.-1 and P.sub.-2 denote the initial pitch estimates of P.sub.-1 and P.sub.-2. In the current frame processing, P.sub.-1 and P.sub.-2 are already available from previous analysis. Let E.sub.-1 (P) and E.sub.-2 (P) denote the functions of Equation (1) obtained from the previous two frames. Then E.sub.-1 (P.sub.-1) and E.sub.-2 (P.sub.-2) will have some specific values.
Since we want continuity of P, we consider P in the range near P.sub.-1. The typical range used is EQU (1-.alpha.).multidot.P.sub.-1 .ltoreq.P.ltoreq.(1+.alpha.).multidot.P.sub.-1( 4)
where .alpha. is some constant.
We now choose the P that has the minimum E(P) within the range of P given by (4). We denote this P as P*. We now use the following decision rule. EQU If E.sub.-2 (P.sub.-2)+E.sub.-1 (P.sub.-1)+E(P*).ltoreq.Threshold, P.sub.I =P* where P.sub.I is the initial pitch estimate of P. (5)
If the condition in Equation (5) is satisfied, we now have the initial pitch estimate P.sub.I. If the condition is not satisfied, then we move to the look-ahead tracking.
Look-ahead tracking attempts to preserve some continuity of P with the future frames. Even though as many frames as desirable can be used, we will use two future frames for our discussion. From the current frame, we have E(P). We can also compute this function for the next two future frames. We will denote these as E.sub.1 (P) and E.sub.2 (P). This means that there will be a delay in processing by the amount that corresponds to two future frames.
We consider a reasonable range of P that covers essentially all reasonable values of P corresponding to human voice. For speech sampled at 8 khz rate, a good range of P to consider (expressed as the number of speech samples in each pitch period) is 22.ltoreq.P&lt;115.
For each P within this range, we choose a P.sub.1 and P.sub.2 such that CE(P) as given by (6) is minimized, EQU CE(P)=E(P)+E.sub.1 (P.sub.1)+E.sub.2 (P.sub.2) (6)
subject to the constraint that P.sub.1 is "close" to P and P.sub.2 is "close" to P.sub.1. Typically these "closeness" constraints are expressed as: EQU (1-.alpha.)P.ltoreq.P.sub.1 .ltoreq.(1+.alpha.)P (7) EQU and EQU (1-.beta.)P.sub.1 .ltoreq.P.sub.2 .ltoreq.(1+.beta.)P.sub.1( 8)
This procedure is sketched in FIG. 3. Typical values for .alpha. and .beta. are .alpha.=.beta.=0.2.
For each P, we can use the above procedure to obtain CE(P). We then have CE(P) as a function of P. We use the notation CE to denote the "cumulative error".
Very naturally, we wish to choose the P that gives the minimum CE(P). However there is one problem called "pitch doubling problem". The pitch doubling problem arises because CE(2P) is typically small when CE(P) is small. Therefore, the method based strictly on the minimization of the function CE(.) may choose 2P as the pitch even though P is the correct choice. When the pitch doubling problem occurs, there is considerable degradation in the quality of synthesized speech. The pitch doubling problem is avoided by using the method described below. Suppose P' is the value of P that gives rise to the minimum CE(P). Then we consider P=P',P'/2,P'/3,P'/4, . . . in the allowed range of P (typically 22.ltoreq.P&lt;115). If P'/2,P'/3,P'/4, . . . are not integers, we choose the integers closest to them. Let's suppose P',P'/2andP'/3, are in the proper range. We begin with the smallest value of P, in this case P'/3, and use the following rule in the order presented.
If ##EQU3## where P.sub.F is the estimate from forward look-ahead feature.
If ##EQU4##
Some typical values of .alpha..sub.1,.alpha..sub.2,.beta..sub.1,.beta..sub.2 are: ##EQU5##
If P'/3 is not chosen by the above rule, then we go to the next lowest, which is P'/2 in the above example. Eventually one will be chosen, or we reach P=P'. If P=P' is reached without any choice, then the estimate P.sub.F is given by P'.
The final step is to compare P.sub.F with the estimate obtained from look-back tracking, P*. Either P.sub.F or P* is chosen as the initial pitch estimate, P.sub.I, depending upon the outcome of this decision. One common set of decision rules which is used to compare the two pitch estimates is:
If EQU CE(P.sub.F)&lt;E.sub.-2 (P.sub.-2)+E.sub.-1)+E(P*) then P.sub.I =P.sub.F( 11)
Else if EQU CE(P.sub.F).gtoreq.E.sub.-2 (P.sub.-2)+E.sub.-1)+E(P*) then P.sub.I =P*(12)
Other decision rules could be used to compare the two candidate pitch values.
The initial pitch estimation method discussed above generates an integer value of pitch. A block diagram of this method is shown in FIG. 4. Pitch refinement increases the resolution of the pitch estimate to a higher sub-integer resolution. Typically the refined pitch has a resolution of 1/4 integer or 1/8 integer.
We consider a small number (typically 4 to 8) of high resolution values of P near P.sub.I. We evaluate E.sub.r (P) given by ##EQU6## where G(.omega.) is an arbitrary weighting function and where ##EQU7## The parameter .omega..sub.0 =2.pi./P is the fundamental frequency and W.sub.r (.omega.) is the Fourier Transform of the pitch refinement window, w.sub.r (n) (see FIG. 1). The complex coefficients, A.sub.M, in (16), represent the complex amplitudes at the harmonics of .omega..sub.0. These coefficients are given by ##EQU8## The form of S.sub.w (.omega.) given in (15) corresponds to a voiced or periodic spectrum.
Note that other reasonable error functions can be used in place of (13), for example ##EQU9## Typically the window function w.sub.r (n) is different from the window function used in the initial pitch estimation step.
An important speech model parameter is the voicing/unvoicing information. This information determines whether the speech is primarily composed of the harmonics of a single fundamental frequency (voiced), or whether it is composed of wideband "noise like" energy (unvoiced). In many previous vocoders, such as Linear Predictive Vocoders or Homomorphic Vocoders, each speech frame is classified as either entirely voiced or entirely unvoiced. In the MBE vocoder the speech spectrum, S.sub.w (.omega.), is divided into a number of disjoint frequency bands, and a single voiced/unvoiced (V/UV) decision is made for each band.
The voiced/unvoiced decisions in the MBE vocoder are determined by dividing the frequency range 0.ltoreq..omega..ltoreq..pi. into L bands as shown in FIG. 5. The constants .OMEGA..sub.0 =0, .OMEGA..sub.1, . . . .OMEGA..sub.L-1, .OMEGA..sub.L =.pi., are the boundaries between the L frequency bands. Within each band a V/UV decision is made by comparing some voicing measure with a known threshold. One common voicing measure is given by ##EQU10## where S.sub.w (.omega.) is given by Equations (15) through (17). Other voicing measures could be used in place (19). One example of an alternative voicing measure is given by ##EQU11##
The voicing measure D.sub.l defined by (19) is the difference between S.sub.w (.omega.) and S.sub.w (.omega.) over the l'th frequency band, which corresponds to .OMEGA..sub.l &lt;.omega.&lt;.OMEGA..sub.l+1. D.sub.l is compared against a threshold function. If D.sub.l is less than the threshold function then the l'th frequency band is determined to be voiced. Otherwise the l'th frequency band is determined to be unvoiced. The threshold function typically depends on the pitch, and the center frequency of each band.
In a number of vocoders, including the MBE Vocoder, the Sinusoidal Transform Coder, and the Harmonic Coder the synthesized speech is generated all or in part by the sum of harmonics of a single fundamental frequency. In the MBE vocoder this comprises the voiced portion of the synthesized speech, .nu.(n). The unvoiced portion of the synthesized speech is generated separately and then added to the voiced portion to produce the complete synthesized speech signal.
There are two different techniques which have been used in the past to synthesize a voiced speech signal. The first technique synthesizes each harmonic separately in the time domain using a bank of sinusiodal oscillators. The phase of each oscillator is generated from a low-order piecewise phase polynomial which smoothly interpolates between the estimated parameters. The advantage of this technique is that the resulting speech quality is very high. The disadvantage is that a large number of computations are needed to generate each sinusiodal oscillator. This computational cost of this technique may be prohibitive if a large number of harmonics must be synthesized.
The second technique which has been used in the past to synthesize a voiced speech signal is to synthesize all of the harmonics in the frequency domain, and then to use a Fast Fourier Transform (FFT) to simultaneously convert all of the synthesized harmonics into the time domain. A weighted overlap add method is then used to smoothly interpolate the output of the FFT between speech frames. Since this technique does not require the computations involved with the generation of the sinusoidal oscillators, it is computationally much more efficient than the time-domain technique discussed above. The disadvantage of this technique is that for typical frame rates used in speech coding (20-30 ms.), the voiced speech quality is reduced in comparison with the time-domain technique.