Automatic speech recognition (ASR) is the method of automatically recognizing the nature of oral instructions based on the information included in speech waves. ASR has ushered in a new generation of security devices based on oral, rather than physical, keys and has made possible a whole range of “no-hands” or “hands-free” features, such as voice dialing and information retrieval by voice.
At the highest level, all ASR systems process speech for feature extraction (also known as signal-processing front end) and feature matching (also known as signal-processing back end). Feature extraction is the method by which a small amount of data is extracted from a speech input to represent the speech input. Feature matching is the method by which the nature of instructions contained in the speech input are identified by comparing the extracted data with a known data set. In a standard ASR system, a single processing unit carries out both of these functions.
The performance of an ASR system that uses speech transmitted, for example, over a mobile or wireless channel as an input, however, may be significantly degraded as compared with the performance of an ASR system that uses the original unmodified speech as the input. This degradation in system performance may be caused by distortions introduced in the transmitted speech by the coding algorithm as well as channel transmission errors.
A distributed speech recognition (DSR) system attempts to correct the system performance degradation caused by transmitted speech by separating feature extraction from feature matching and having the two methods executed by two different processing units disposed at two different locations. For example, in a DSR mobile or wireless communications system or network including a first communication device (e.g., a mobile unit) and a second communication device (e.g., a server), the mobile unit performs only feature extraction, i.e., the mobile unit extracts and encodes recognition features from the speech input. The mobile unit then transmits the encoded features over an error-protected data channel to the server. The server receives the encoded recognition features, and performs only feature matching, i.e., the server matches the encoded features to those in a known data set.
With this approach, coding distortions are minimized, and transmission channel errors have very little effect on the recognition system performance. Moreover, the mobile unit has to perform only the relatively computationally inexpensive feature extraction, leaving the more complex, expensive feature matching to the server. By reserving the more computationally complex activities to the server processor, greater design flexibility is preserved for the mobile unit processor, where processor size and speed typically are at a premium given the recent emphasis on unit miniaturization.
The European Telecommunications Standards Institute (ETSI) recently published a standard for DSR feature extraction and compression algorithms. European Telecommunications Standards Institute Standard ES 201 108, Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms, Ver. 1.1.2, April 2000 (hereinafter “ETSI Front-End Standard”), hereby incorporated by reference in its entirety. While several methods, such as Linear Prediction (LP), exist for encoding data from a speech input, the ETSI Front-End Standard includes a feature extraction algorithm that extracts and encodes the speech input as a log-energy value and a series of Mel-frequency cepstral coefficients (MFCC) for each frame. These parameters essentially capture the spectral envelope information of the speech input, and are commonly used in most large vocabulary speech recognizers. The ETSI Front-End Standard further includes algorithms for compression (by vector quantization) and error-protection (cyclic redundancy check codes). The ETSI Front-End Standard also describes suitable algorithms for bit stream decoding and channel error mitigation. At an update interval of 10 ms and with the addition of synchronization and header information, the data transmission rate works out to 4800 bits per second.
More recently, the European Telecommunications Standards Institute (ETSI) has published another standard for DSR feature extraction and compression algorithms. European Telecommunications Standards Institute Standard ES 202 050, Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced Front-end feature extraction algorithm; Compression algorithms, Ver. 1.1.1, July 2002 (hereinafter “ETSI Advanced Front-End Standard”), hereby incorporated by reference in its entirety. The ETSI Advanced Front-End Standard is quite similar to the ETSI Front-End Standard in terms of the features extracted, bit rate, and so on but is more noise-robust. That is, the ETSI Advanced Front-End Standard provides better performance under noisy background conditions.
In summary, a DSR system, such as one designed in accordance with the ETSI Front-End Standard (or the ETSI Advanced Front-End Standard), offers many advantages for mobile communications network implementation. Such a system may provide equivalent recognition performance to an ASR system, but with a low complexity front-end that may be incorporated in a mobile unit and a low bandwidth requirement for the transmission of the coded recognition features.
DSR systems have the drawback that the original speech input is not available at the back end for storage and/or verification purposes. It would be helpful to have the original speech input available for: (i) back end applications that require human assistance, e.g., to permit hand correction of documents generated using remote dictation systems by allowing comparison of the document to the original speech input or to permit smooth transition when a recognition task is handed over from a DSR system to a human operator; (ii) prophylactic storage of legally sensitive information, e.g., to record the exact statements made during financial transactions such as the placement of a securities order; and (iii) validation of utterances during database collection, e.g., for training the recognizer in batch mode (and especially incremental mode) and system tune-up.
On the other hand, original speech is available at the back end if a standard ASR system is used. However, as noted above, ASR has significant distortion difficulties when used in a mobile or wireless application. In order to address this issue, U.S. patent application Publication No. 2002/0147579 (which is incorporated by reference herein) provides for a method for speech reconstruction at the back end using a sinusoidal vocoder. In accordance with the '579 application, 13 transmitted MFCCs (C0–C12) are transformed into harmonic magnitudes that are utilized in speech reconstruction.
The above technique for transforming MFCCs into harmonic magnitudes works fairly well. The speech reconstructed by a sinusoidal coder using these transformed magnitudes is highly intelligible and of reasonable quality. However, it is apparent that the reconstruction performance (in terms of speech intelligibility and quality) would be better if all the 23 MFCC values (C0–C22) were available instead of only the 13 transmitted values, viz., C0–C12. Therefore, a need exists for a method and apparatus for speech reconstruction within a distributed speech recognition system that makes use of missing MFCC values to improve speech reconstruction.