The present invention is directed to a method and apparatus for speech reconstruction, and, in particular, a method and apparatus for speech reconstruction in a distributed speech recognition system.
Automatic speech recognition (ASR) is the method of automatically recognizing the nature of oral instructions based on the information included in speech waves. ASR has ushered in a new generation of security devices based on oral, rather than physical, keys and has made possible a whole range of xe2x80x9cno-handsxe2x80x9d or xe2x80x9chands-freexe2x80x9d features, such as voice dialing and information retrieval by voice.
At the highest level, all ASR systems process speech for feature extraction (also known as signal-processing front end) and feature matching (also known as signal-processing back end). Feature extraction is the method by which a small amount of data is extracted from a speech input to represent the speech input. Feature matching is the method by which the nature of instructions contained in the speech input are identified by comparing the extracted data with a known data set. In a standard ASR system, a single processing unit carries out both of these functions.
The performance of an ASR system that uses speech transmitted, for example, over a mobile or wireless channel as an input, however, may be significantly degraded as compared with the performance of an ASR system that uses the original unmodified speech as the input. This degradation in system performance may be caused by distortions introduced in the transmitted speech by the coding algorithm as well as channel transmission errors.
A distributed speech recognition (DSR) system attempts to correct the system performance degradation caused by transmitted speech by separating feature extraction from feature matching and having the two methods executed by two different processing units disposed at two different locations. For example, in a DSR mobile or wireless communications system or network including a first communication device (e.g., a mobile unit) and a second communication device (e.g., a server), the mobile unit performs only feature extraction, i.e., the mobile unit extracts and encodes recognition features from the speech input. The mobile unit then transmits the encoded features over an error-protected data channel to the server. The server receives the encoded recognition features, and performs only feature matching, i.e., the server matches the encoded features to those in a known data set.
With this approach, coding distortions are minimized, and transmission channel errors have very little effect on the recognition system performance. Moreover, the mobile unit has to perform only the relatively computationally inexpensive feature extraction, leaving the more complex, expensive feature matching to the server. By reserving the more computationally complex activities to the server processor, greater design flexibility is preserved for the mobile unit processor, where processor size and speed typically are at a premium given the recent emphasis on unit miniaturization.
The European Telecommunications Standards Institute (ETSI) recently published a standard for DSR feature extraction and compression algorithms. European Telecommunications Standards Institute Standard ES 201 108, Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms, Ver. 1.1.2, April 2000 (hereinafter xe2x80x9cETSI Standardxe2x80x9d), hereby incorporated by reference in its entirety. While several methods, such as Linear Prediction (LP), exist for encoding data from a speech input, the ETSI Standard includes a feature extraction algorithm that extracts and encodes the speech input as a log-energy value and a series of Mel-frequency cepstral coefficients (MFCC) for each frame. These parameters essentially capture the spectral envelope information of the speech input, and are commonly used in most large vocabulary speech recognizers. The ETSI Standard further includes algorithms for compression (by vector quantization) and error-protection (cyclic redundancy check codes). The ETSI standard also describes suitable algorithms for bit stream decoding and channel error mitigation. At an update interval of 10 ms and with the addition of synchronization and header information, the data transmission rate works out to 4800 bits per second.
In summary, a DSR system, such as one designed in accordance with the ETSI Standard, offers many advantages for mobile communications network implementation. Such a system may provide equivalent recognition performance to an ASR system, but with a low complexity front-end that may be incorporated in a mobile unit and a low bandwidth requirement for the transmission of the coded recognition features.
DSR systems have the drawback that the original speech input is not available at the back end for storage and/or verification purposes. It would be helpful to have the original speech input available for: (i) back end applications that require human assistance, e.g., to permit hand correction of documents generated using remote dictation systems by allowing comparison of the document to the original speech input or to permit smooth transition when a recognition task is handed over from a DSR system to a human operator; (ii) prophylactic storage of legally sensitive information, e.g., to record the exact statements made during financial transactions such as the placement of a securities order; and (iii) validation of utterances during database collection, e.g., for training the recognizer in batch mode (and especially incremental mode) and system tune-up.
On the other hand, original speech is available at the back end if a standard ASR system is used. However, as noted above, ASR has significant distortion difficulties when used in a mobile or wireless application. That is, coded speech at the desired bit rate of around 4800 bps significantly degrades the performance of the recognizer. Alternatively, a separate high quality speech coder could be provided, but this would require a significant increase in bandwidth.