Automatic Speech Recognition (ASR) systems that convert speech to text typically comprise two main processing stages, often referred to as the “front-end” and the “back-end.” The front-end typically converts digitized speech into a set of features that represent the speech content of the spectrum of the speech signal, usually sampled at regular intervals. The features are then converted to text at the back-end.
During feature extraction the speech signal is typically divided into overlapping frames, with each frame having a predefined duration. A feature vector, typically having a predefined number of features, is then calculated for each frame. In most ASR systems a feature vector is obtained by:
a) deriving an estimate of the spectral envelope corresponding to the frame;
b) multiplying the estimate of the spectral envelope by a predetermined set of frequency domain weighting functions, where each weighting function is non-zero over a narrow range of frequencies, known as the frequency channel, and computing the integrals thereof, known as bins, to form a binned spectrum; and
c) assigning the computed integrals or a set of pre-determined functions thereof to respective components of the feature vector.
Many ASR systems employ speech recognition features know as Mel Frequency Cepstral Coefficients (MFCC) that are obtained by employing specific frequency domain weighting functions at step b) and computing a cosine transform of the logarithm of the binned spectrum at step c). Typically, the spectral envelope estimate at step a) is represented by the Amplitude Short Time Spectrum (ASTS) or Power Short Time Spectrum (PSTS) of the frame. The ASTS and PSTS are obtained as absolute values and squared absolute values respectively of the Short Time Fourier Transform (STFT) applied to the frame, where the frame is multiplied by a smooth windowing function, such as a Hamming window, and then transformed using the Discrete Fourier Transform (DFT). The frequency channels used in step b) typically overlap, and a frequency channel with a higher channel number has a greater width than a frequency channel with a lower channel number. A Mel transform function Mel(f) of the frequency axis may be used to define the frequency channel, where Mel(f) is a convex non-linear function of f whose derivative increases rapidly with f. A typical example is Mel(f)=2595*log10(1+f/700), where f is a frequency in Hz. A set of equidistant points mfi, i=0, . . . , N+1, are defined at the mel-frequency interval [Mel(fstart), Mel(fNyquist)] as follows:
      mf    i    =            Mel      ⁡              (                  f          start                )              +          i      ×                                    Mel            ⁡                          (                              f                Nyquist                            )                                -                      Mel            ⁡                          (                              f                start                            )                                                N          +          1                    where fstart is a starting point of the frequency analysis interval, such as 64 Hz, and fNyquist is the Nyquist frequency of the speech signal. The frequency channel used to generate the ith bin value is [f(i−1), f(i+1)], where i=1, 2, . . . , N, and fi are given by the inverse Mel transform fi=Mel−1 (mfi). The corresponding frequency weighting function, called a Mel filter, is defined to be a hat function having two segments that are linear in Mel frequency. The first segment ascends from f(i−1) to f(i), while the second segment descends from f(i) to f(i+1). The weighting functions are sampled at DFT points. The value of the ith bin is obtained by multiplying the ith weighting function by the values of discretely sampled estimate of the spectral envelope, and summing the result. This process is called Mel filtering. The resulting components partition the spectrum into frequency bins that group together the spectral components within the channel through weighted summation. To obtain the Mel Cepstrum, the bins are increased if necessary to be always larger than some small number such as b−50, where b is the base of the logarithm operation, i.e. 10 or e, and the log of the result is taken. The DCT of the sequence of logs is then computed, and the first L transform coefficients, where (L≦N), are assigned to corresponding coordinates of the MFCC vector {C0,C1,C2, . . . , CL−1} which is used by the ASR back-end.
While various MFCC front-end schemes might employ different spectral envelope estimation techniques, Mel function definitions, numbers N of frequency channels, etc., the maximal dimension N of an MFCC vector is equal to the number of frequency domain weighting functions or the number of bin values. The starting coordinates of the MFCC vector, referred to as low-order cepstra (LOC), generally reflect the global shape of the spectral envelope, while the ending coordinates, referred to as high-order cepstra (HOC), typically have relatively small values, and generally reflect the rapidly-varying-in-frequency nature of the spectrum. It has been observed that in small vocabulary recognition tasks the recognition accuracy is virtually unaffected when L≅N/2, i.e., when the MFCC vector is truncated by 50%.
In some ASR systems, the recording of a speech signal and the subsequent speech recognition are performed by processors at separate locations, such as where a speech signal is recorded at a client device, such as a cell phone, and processed at an ASR server. Audio information that is captured at a client device is often transmitted to a server over a communications channel. Typically, and especially where the client and server communicate via a wireless network, it is not feasible to transmit the entire speech signal due to communications channel bandwidth limitations. Therefore, the speech signal is typically compressed. However, it is imperative that the compression scheme used to compress the speech will not significantly reduce the recognition rate at the server. Thus, in some systems a compressed version of the recognition features themselves is transmitted to the server. Since redundant information has been already removed in generating these features, an optimal compression rate can be attained.
In one such implementation of recording and performing speech recognition at different locations, known as Distributed Speech Recognition (DSR), a client device performs front-end speech processing where features are extracted, compressed, and transmitted via a communications channel to a server, which then performs back-end speech processing including speech-to-text conversion. In order to conserve bandwidth, MFCC vectors are often truncated in DSR systems prior to transmission. For example, the ETSI DSR standards ES 201 108 (April 2000) and ES 202 050 (July 2002) define two different front-end feature extraction and compression algorithms employing MFCC vectors where only 13 cepstra (L=13) out of 23 (N=23) are transmitted to the server for ASR back-end processing.
In some DSR systems, speech reconstruction and playback capabilities are required at the server. Where pitch is derived for each frame during speech processing, various techniques may be used to synthesize a speech signal using MFCC vectors and pitch. Unfortunately, while truncated MFCC vectors are suitable for speech recognition, speech reconstruction quality suffers significantly where truncated MFCC vectors are employed. Truncated MFCC vectors reduce the accuracy of spectra estimation, resulting in reconstructed speech having a “mechanical” sound quality. Therefore, a method for restoring high-order Mel frequency cepstral coefficients of truncated MFCC vectors would be advantageous.