The invention relates to electronic devices, and, more particularly, to speech coding, transmission, storage, and synthesis circuitry and methods.
Human speech consists of a stream of acoustic signals with frequencies ranging up to roughly 20 KHz; however, the band of about 100 Hz to 5 KHz contains the bulk of the acoustic energy. Telephone transmission of human speech originally consisted of conversion of the analog acoustic signal stream into an analog voltage signal stream (e.g., use a microphone) for transmission and reconversion to an acoustic signal stream (e.g., use a loudspeaker). The electrical signals would be bandpass filtered to retain only the 300 Hz to 4 KHz band to limit bandwidth and avoid low frequency problems. However, the advantages of digital electrical signal transmission has inspired a conversion to digital telephone transmission beginning in the 1960s. Typically, digital telephone signals derive from sampling analog signals at 8 KHz and nonlinearly quantizing the samples with 8 bit codes according to the xcexc-law (pulse code modulation, or PCM). A clocked digital-to-analog converter and companding amplifier reconstruct an analog electric signal stream from the stream of 8-bit samples. Such signals require transmission rates of 64 Kbps (kilobits per second) and this exceeds the former analog signal transmission bandwidth.
The storage of speech information in analog format (for example, on magnetic tape in a telephone answering machine) can likewise by replaced with digital storage. However, the memory demands can become overwhelming: 10 minutes of 8-bit PCM sampled at 8 KHz would require about 5 MB (megabytes) of storage.
The demand for lower transmission rates and storage requirements has led to development of compression for speech signals. One approach to speech compression models the physiological generation of speech and thereby reduces the necessary information to be transmitted or stored. In particular, the linear speech production model presumes excitation of a variable filter (which roughly represents the vocal tract) by either a pulse train with pitch period P (for voiced sounds) or white noise (for unvoiced sounds) followed by amplification to adjust the loudness. 1/A(z) traditionally denotes the z transform of the filter""s transfer function. The model produces a stream of sounds simply by periodically making a voiced/unvoiced decision plus adjusting the filter coefficients and the gain. Generally, see Markel and Gray, Linear Prediction of Speech (Springer-Verlag 1976). FIG. 1 illustrates the model, and FIGS. 2a-3b illustrate sounds. In particular, FIG. 2a shows the waveform for the voiced sound /ae/ and FIG. 2b its Fourier transform; and FIG. 3a shows the unvoiced sound /sh/ and FIG. 3b its Fourier transform.
The filter coefficients may be derived as follows. First, let sxe2x80x2(t) be the analog speech waveform as a function of time, and exe2x80x2(t) be the analog speech excitation (pulse train or white noise). Take the sampling frequency fs to have period T (so fs=1/T), and set s(n)=sxe2x80x2(nT) (so . . . s(nxe2x88x921), s(n), s(n+1), . . . is the stream of speech samples), and set e(n)=exe2x80x2(nT) (so . . . e(nxe2x88x921), e(n), e(n+1), . . . are the samples of the excitation). Then taking z transforms yields S(z)=E(z)/A(z) or, equivalently, E(z)=A(z)S(z) where 1/A(z) is the z transform of the transfer function of the filter. A(z) is an all-zero filter and 1/A(z) is an all-pole filter. Deriving the excitation, gain, and filter coefficients from speech samples is an analysis or coding of the samples, and reconstructing the speech from the excitation, gain, and filter coefficients is a decoding or synthesis of speech. The peaks in 1/A(z) correspond to resonances of the vocal tract and are termed xe2x80x9cformantsxe2x80x9d. FIG. 4 heuristically shows the relations between voiced speech and voiced excitation with a particular filter A(z).
With A(z) taken as a finite impulse response filter of order M, the equation E(z)=A(z)S(z) in the time domain becomes, with a(0)=1 for normalization:                               e          ⁡                      (            n            )                          =                  xe2x80x83                ⁢                              ∑            j                    ⁢                                    a              ⁡                              (                j                )                                      ⁢                          s              ⁡                              (                                  n                  -                  j                                )                                                                                  xe2x80x83                ⁢                  0          ≤          j          ≤          M                                        =                  xe2x80x83                ⁢                              s            ⁡                          (              n              )                                +                                    ∑              j                        ⁢                                          a                ⁡                                  (                  j                  )                                            ⁢                              s                ⁡                                  (                                      n                    -                    j                                    )                                                                                                  xe2x80x83                ⁢                  1          ≤          j          ≤          M                    
Thus by deeming e(n) a xe2x80x9clinear prediction errorxe2x80x9d between the actual sample s(n) and the xe2x80x9clinear predictionxe2x80x9d sum a(j)s(nxe2x88x92j), the filter coefficients a(j) can be determined from a set of samples s(n) by minimizing the prediction xe2x80x9cerrorxe2x80x9d sum e(n)2.
A stream of speech samples s(n) may be partitioned into xe2x80x9cframesxe2x80x9d of 180 successive samples (22.5 msec intervals), and the samples in a frame provide the data for computing the filter coefficients for use in coding and synthesis of the sound associated with the frame. Typically, M is taken as 10 or 12. Encoding a frame requires bits for the LPC coefficients, the pitch, the voiced/unvoiced decision, and the gain, and so the transmission rate may be only 2.4 Kbps rather than the 64 Kbps of PCM. In practice, the filter coefficients must be quantized for transmission, and the sensitivity of the filter behavior on the quantization error has led to quantization based on the Line Spectrum Pair representation.
The pitch period P determination presents a difficult problem because 2P, 3P, . . . are also periods and the sampling quantization and the formants can distort magnitudes. In fact, W.Hess, Pitch Determination of Speech Signals (Springer, 1983) presents many different methods for pitch determination. For example, the pitch period estimation for a frame may be found by searching for maximum correlations of translates of the speech signal. Indeed, Medan et al, Super Resolution Pitch Determination of Speech Signals, 39 IEEE Tr.Sig.Proc. 40 (1991) describe a pitch period determination which first looks at correlations of two adjacent segments of speech with variable segment lengths and determines an integer pitch as the segment length which yields the maximum correlation. Then linear interpolation of correlations about the maximum correlation gives a pitch period which may be a nonintegral multiple of the sampling period.
The voiced/unvoiced decision for a frame may be made by comparing the maximum correlation c(k) found in the pitch search with a threshold value: if the maximum c(k) is too low, then the frame will be unvoiced, otherwise the frame is voiced and uses the pitch period found.
The overall loudness of a frame may be estimated simply as the root-mean-square of the frame samples takig into account the gain of the LPC filtering. This provides the gain to apply in the synthesis.
To reduce the bit rate, the coefficients for successive frames may be interpolated.
However, to improve the sound quality, further information may be extracted from the speech, compressed and transmitted or stored. For example, the codebook excitation linear prediction (CELP) method first analyzes a speech frame to find A(z) and filter the speech, next, a pitch period determination is made and a comb filter removes this periodicity to yield a noise-looking excitation signal. Then the excitation signals are encoded in a codebook. Thus CELP transmits the LPC filter coefficients, the pitch, and the codebook index of the excitation.
Another approach is to mix voiced and unvoiced excitations for the LPC filter. For example, McCree, A New LPC Vocoder Model for Low Bit Rate Speech Coding, PhD thesis, Georgia Institute of Technology, August 1992, divide the excitation frequency range into bands, make the voiced/unvoiced mixture decision in each band separately, and combine the results for the total excitation. The pitch determination proceeds as follows. First, lowpass filter (cutoff at about 1200 Hz) the speech because the pitch frequency should fall in the range of 100 Hz to 400 Hz. Next, filter with A(z) in order to remove the formant structure and, hopefully, yield e(n). Then compute a normalized correlation for each translate k:
c(k)=xcexa3e(n)e(nxe2x88x92k)/(xcexa3e(n)2xcexa3e(nxe2x88x92k)2)
where both sums are over a fixed number of samples, which should be as large as the maximum expected pitch period. The k maximizing c(k) yields a pitch period estimation as kT. Then check whether kT is in fact a multiple of a fundamental pitch period. A frame is classified as strongly voiced if a maximum normalized c(k) is greater than 0.7, weakly voiced if the maximum c(k) is between 0.4 and 0.7, and further analyzed if the maximum c(k) is less than 0.4. A maximum c(k) less than 0.4 may be due to unvoiced sounds or the A(z) filtering may be obscuring the pitch as when the pitch frequency lies close to a formant, so again compute correlations but using the unfiltered speech signals s(n). If the maximum correlation is still small, then the frame will be classified as unvoiced.
The present invention recognizes that in the mixed excitation linear prediction method the inaccuracy of an integer period pitch determination for high-pitched female speakers can lead to a locking on to a pitch for artifically long time periods with abrupt discontinuity in the pitch contour at a change to a new pitch. Also, the invention recognizes telephone-bandwidth speech typically has filtered out the 100-200 Hz pitch fundamental for male speakers and this leads to pitch estimation and excitation mixture errors. The invention provides pitch period determinations which do not have to be multiples of the sampling period and uses the corresponding correlations for mixture control and also for integer pitch determinations.
The invention has technical advantages including natural sounding speech from a low bit rate encoding.