This invention relates generally to speech recognition for the purpose of speech to text conversion and, in particular, to speech reconstruction from speech recognition features.
In the following description reference is made to the following publications:
[1] Kazuhito Koishida, Keiichi Tokuda, Takao Kobayashi, Satoshi Imai, xe2x80x9cCelp Coding Based on Mel Cepstral Analysisxe2x80x9d, Speech ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processingxe2x80x94Proceedings v 1 1995. IEEE, Piscataway, N.J. [See definition of Mel Cesptrum on page 33].
[2] Stylianou, Yannis Cappe, Olivier Moulines, Eric, xe2x80x9cContinuous probabilistic transform for voice conversionxe2x80x9d, IEEE Transactions on Speech and Audio Processing v 6 n 2 March 1998. pp131-142 [See page 137 defining the cepstral parameters c(i)].
[3] McAulay, R. J. Quatieri, T. F. xe2x80x9cSpeech Analysis-Synthesis Based on a Sinusoidal Representationxe2x80x9d, IEEE Trans.Acoust. Speech, Signal Processing Vol. ASSP-34, No. 4, August 1986.
[4] L. B. Almeida, F. M. Silva, xe2x80x9cVariable-Frequency Synthesis: An improved Harmonic Coding Schemexe2x80x9d, Proc ICASSP pp237-244 1984.
[5] McAulay, R. J. Quatieri, T. F. xe2x80x9cSinusoidal Coding in Speech Coding and Synthesisxe2x80x9d, W. Kleijn and K. Paliwal Eds., Elsevier 1995 ch. 4.
[6] S. Davis and P. Mermelstein, xe2x80x9cComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesxe2x80x9d, IEEE Trans ASSP, Vol. 28, No. 4, pp. 357-366, 1980.
All speech recognition schemes for the purpose of speech to text conversion start by converting the digitized speech to a set of features that are then used in all subsequent stages of the recognition process. These features, usually sampled at regular intervals, extract in some sense the speech content of the spectrum of the speech signal. In many systems, the features are obtained by the following three-step procedure:
(a) deriving at successive instances of time an estimate of the spectral envelope of the digitized speech signal,
(b) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, and
(c) assigning the computed integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors.
The center of mass of successive weight functions are monotonically increasing. A typical example is the Mel Cepstrum, which is obtained by a specific set of weight functions that are used to obtain the integrals of the products of the spectrum and the weight functions at step (b). These integrals are called xe2x80x98binxe2x80x99 values and form a binned spectrum. The truncated logarithm of the binned spectrum is then computed and the resulting vector is cosine transformed to obtain the Mel Cepstral values.
There are a number of applications that require the ability to reproduce the speech from these features. For example, the speech recognition may be carried out on a remote server, and at some other station connected to that server it is desired to listen to the original speech. Because of channel bandwidth limitation, it is not possible to send the original speech signal from the client device used as an input device to the server and from that server to another remote client device. Therefore, the speech signal must be compressed. On the other hand, it is imperative that the compression scheme used to compress the speech will not affect the recognition rate.
An effective way to do that is to simply send a compressed version of the recognition features themselves, as it may be expected that all redundant information has been already removed in generating these features. This means that an optimal compression rate can be attained. Because the transformation from speech signal to features is a many-to-one transformation, i.e. it is not invertible, it is not evident how the reproduction of speech from features can be carried out, if at all.
To a first approximation, the speech signal at any time can assumed to be voiced, unvoiced or silent. The voiced segments represent instances where the speech signal is nearly periodic. For speech signals, this period is called pitch. To measure the degree to which the signal can be approximated by a periodic signal, xe2x80x98windowsxe2x80x99 are defined. These are smooth functions e.g. hamming functions, whose width is chosen to be short enough so that inside each window the signal may be approximated by a periodic function. The purpose of the window function is to discount the effects of the drift away from periodicity at the edges of the analysis interval. The window centers are placed at regular intervals on the time axis. The analysis units are then defined to be the product of the signal and the window function, representing frames of the signal. On each frame, the windowed square distance between the true spectrum and its periodic approximation may serve as a measure of periodicity. It is well known that any periodic signal can be represented as a sum of sine waves that are periodic with the period of the signal. Each sine wave is characterized by its amplitude and phase. For any given fundamental frequency (pitch) of the speech signal, the sequence of complex numbers representing the amplitudes and phases of the coefficients of the sine waves will be referred to as the xe2x80x9cline spectrumxe2x80x9d. It turns out that it is possible to compute a line spectrum for speech that contains enough information to reproduce the speech signal so that the human ear will judge it almost indistinguishable from the original signal (Almeida [4], McAuley et al. [5]). A particularly simple way to reproduce the signal from the sequence of line spectra corresponding to a sequence of frames, is simply to sum up the sine waves for each frame, multiply each sum by its window, add these signal segments over all frames to obtain segments of reconstructed speech of arbitrary length. This procedure will be effective if the windows sum up to a roughly constant time function.
The line spectrum can be viewed as a sequence of samples at multiples of the pitch frequency of a spectral envelope representing the utterance for the given instant. The spectral envelope represents the Fourier transform of the infinite impulse response of the mouth while pronouncing that utterance. The essential fact about a line spectrum is that if it represents a perfectly periodic signal whose period is the pitch, the individual sine waves corresponding to particular frequency components over successive frames are aligned, i.e. they have the precise same value at every given point in time, independent of the source frame. For a real speech signal, the pitch varies from one frame to another. For this reason, the sine waves resulting from the same frequency component for successive frames are only approximately aligned. This is in contrast to the sine waves corresponding to components of the discrete Fourier transform, which are not necessarily aligned individually from one frame to the next. For unvoiced intervals, a pitch equal to the Fourier analysis interval is arbitrarily assumed. It is also known that given only the set of absolute values of the line spectral coefficients, there are a number of ways to generate phases (McAuley [3], [5]), so that the signal reproduced from the line spectrum having the given amplitudes and the computed phases, will produce speech of very acceptable resemblance to the original signal.
Given any approximation of the spectral envelope, a common way to compute features is the so-called Mel Cepstrum. The Mel Cepstrum is defined through a discrete cosine transform (DCT) on the log Mel Spectrum. The Mel Spectrum is defined by a collection of windows, where the ith window (i=0,1,2, . . . ) is centered at frequency f(i) where f(i)=MEL(axc2x7i) and f(i+1) greater than f(i). The function MEL(f) is a convex non-linear function of f whose derivative increases rapidly with f. The numbers (axc2x7i) can be viewed as representing Mel Frequencies. The value of a is chosen so that if N is the total number of Mel frequencies, MEL(axc2x7N) is the Nyquist frequency of the speech signal. The window used to generate the ith component of the Mel Spectrum is defined to have its support on the interval [f(ixe2x88x921),f(i+1)] and to be a hat function consisting of two segments, which are linear in Mel frequency. The first, ascending from f(ixe2x88x921) to f(i), and the second, descending from f(i) to f(i+1). The value of the ith component of the Mel Spectrum is obtained by multiplying the ith window by the absolute value of discretely sampled estimate of the spectral envelope, and summing the result. The resulting components can be viewed as partitioning the spectrum into frequency bins that group together the spectral components within the window through the weighted summation. To obtain the Mel Cepstrum, the bins are increased if necessary to be always larger than some small number, and the log of the result is taken. The discrete cosine transform of the sequence of logs is computed, and the first L transform coefficients (Lxe2x89xa6N) are used to represent the Mel Cepstrum.
From what is said above, in order to reproduce the signal from the Mel Cepstrum, it is necessary to estimate the absolute values of the line spectrum, combine those with the synthetically generated phases, sum up the sine components, multiply that sum by the time window and overlap add the results. What is needed therefore is a way to obtain the line spectrum from the Mel-Cepstrum.
Tokuda et al. [1] propose some procedure for reproducing the spectrum from the Mel Cepstrum. However their definition of the Mel Cepstrum is rather restrictive, and is not in line with some of the features used in today""s existing speech recognition systems. Rather than performing a simple integration on the spectrum of the signal, the definition used by them is based on an iterative procedure that is optimal in terms of some error measure. The spectral estimation procedure proposed by them has as it is defined today no latitude for other methods for computing the cepstrum.
Stylianou et al. [2] also present a technique for spectral reconstruction from cepstral like parameters. Again the definition of Cepstrum is quite specific, and is chosen to allow spectral reconstruction a priori rather than use very simply computed integrated Mel Cepstral parameters which are presently in use in many speech recognition systems.
It is therefore an object of the invention to provide an improved method for spectral reconstruction from Cepstral like parameters that can use a wide class of spectral representations including those commonly used in today""s speech recognition systems.
This object is realized in accordance with a broad aspect of the invention by a speech reconstruction method for converting a series of binned spectra or functions thereof which will be referred to as xe2x80x9cfeature vectorsxe2x80x9d and a series of respective pitch values and voicing decisions of an original input speech signal into a speech signal, the feature vectors being obtained as follows:
(i) deriving at successive instances of time an estimate of a spectral envelope SE(i), i being a frequency index, of the digitized original speech signal,
(ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, BW(i,k), i being a frequency index and k being the window function index, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, according to the expression:             BI      ⁢              (        k        )              =                  ∑        i            ⁢                        SE          ⁢                      (            i            )                          ·                  BW          ⁢                      (                          i              ,              k                        )                                ,
where BI(k) is defined as the kth component of a xe2x80x9cbinned spectrumxe2x80x9d, and
(iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
said speech reconstruction method comprising:
(a) converting each feature vector into a binned spectrum in some consistent manner,
(b) generating harmonic frequencies and weights according to the corresponding pitch and voicing decision,
(c) generating for each harmonic frequency a respective phase, depending on the corresponding pitch value and voicing decision and possibly on the binned spectrum,
(d) sampling each of the basis functions at all harmonic frequencies which are within its support, the support of the basis functions being bounded, and multiplying by the respective harmonic weight, so as to produce for each sampled basis function a respective line spectrum having multiple components,
(e) combining each component of each respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function,
(f) generating gain coefficients of the basis functions,
(g) multiplying each of the points of the complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all resulting complex line spectra to generate a single complex line spectrum having a respective component for each of the harmonic frequencies, and
(h) generating a time signal from complex line spectra computed at successive instances of time.
The principal novelty of the invention resides in the representation of the line spectrum of the output signal spectrum in terms of a non-negative linear combination of sampled narrow support basis functions, whilst maintaining the condition that the reproduced spectrum will have bins that are close to those of the original signal. This also embraces the particular case in which the envelope is computed by simply taking the absolute values of the Fourier transform of a windowed segment of the signal, wherein that same process is mimicked in the generation of the equations expressing the condition that the bins of the result are close to those of the original signal.
In the preferred embodiment described below, the complex spectrum of each basis function is converted to a windowed discrete Fourier transform. This is done by a convolution with the analysis window Fourier transform. Consequently, the linear combination at step (g) above is carried out directly on the windowed DFTs, to produce a windowed DFT, corresponding to a single frame of speech.