Speech recognition technology allows a user of a telecommunications network to access computer services without using a keyboard to type in words, while a spoken language system provides user-computer interaction, which enables natural conversations between people and machines. In particular, Distributed Speech Recognition (DSR) systems allow a user to give a verbal command, or dictate a memo, to a speech-processing device at one location and have the spoken words converted into written texts by a speech recognizer at another location. For example, the user can speak into a wireless device, such as a mobile phone, but the voice is recovered by a network device at a remote location. One of the emerging applications of DSR is a Voice Browser or a Wireless Application Protocol (WAP) Browser, which allows anyone who has a telephone to access Internet-based services without being near a computer. DSR has many benefits. For example, voice interaction eliminates the need of having a keypad on a mobile device where physical space is limited for keypads and displays.
A DSR system is roughly divided into a front-end portion and a back-end portion. The front-end algorithm converts the input speech waveform signal into feature parameters, which provide a compact representation of the input speech, while retaining the information essential for speech recognition. The back-end algorithm performs the actual recognition task, taking feature parameters as input and performing a template-matching operation to compare the features with reference templates of the possible words to be recognized.
In traditional Automatic Speech Recognition (ASR), both the front end and back end are located at the speech recognition server, which is accessed through the Public Switched Telephone Network (PSTN) speech connection. If the speech signal comes from a mobile phone user, significant degradation of speech recognition accuracy may result from speech coding inaccuracies and radio transmission errors. Moreover, if the recognition results from ASR are used to drive a service that returns data to the user terminal, separate speech and data connections between the user terminal and the service are required.
DSR solves these problems of ASR by placing the front-end at the user terminal and transmitting feature parameters instead of the encoded speech waveform to the ASR server. Usually, feature parameters require less bandwidth for radio transmission than the encoded speech waveform. The feature parameters can, therefore, be sent to the ASR server using a data channel. This will eliminate the need for a high, bit-rate speech channel. Moreover, a low-rate data transmission is less affected by noise and distortion, as compared to a speech-channel transmission. Furthermore, if the data channel is equipped with error correction coding, the radio interface errors are no longer an issue. The full duplex data connection used to transmit the features to the ASR server can also be used to send the response data (or the encoded speech) from the ASR server to the user terminal.
One of the major disadvantages of the above-mentioned DSR methodology is that the ASR server must be able to receive and use the features coming from the standard front-end. Therefore, to support DSR, ASR vendors will have to modify their ASR engines to accommodate the DSR features. Depending on the technology used, this may be a minor undertaking or a technical challenge. If the feature vectors are sent to the ASR server using the fourteen components for each 10 ms frame of speech, the resulting bit-rate would be 44.8 kbps, assuming floating point coefficients and no framing overhead. This bit-rate is clearly too high for cellular data channels.
The European Telecommunications Standard Institute (ETSI) is currently in the process of establishing the standard for DSR signal processing. ETSI has published in ETSI ES 201 108 V1.1.2 a standard algorithm for front-end feature extraction and their transmission. The standard algorithm calculates feature vectors with fourteen components in 10 ms frames of speech. In particular, this ETSI publication covers the algorithm for front-end feature extraction to create Mel-Frequency Cepstral Coefficients (MFCCs). In order to allow cellular data channels to be used for data transmission, the ETSI standard also includes a feature compression algorithm to provide an efficient way to transmit the coefficients in a lower data transmission rate. This compression algorithm combines 24 feature vectors, each of which is calculated from one 10 ms frame of speech, to a multiframe of 143 bytes. This yields a bit-rate of roughly 4,767 bps. The ETSI publication also includes the formatting of the extracted features with error protection into a bit-stream for transmissions and the decoding of the bit-stream to obtain the speech features at a back-end receiver, together with the associated algorithm for channel error mitigation. Nokia ETSI-STQ W1008 also discloses a front-end algorithm for feature-vector extraction.
Cepstrum is a term for the inverse Fourier Transform of the logarithm of the power spectrum of a signal, and mel-frequency warping is a process for non-linearly modifying the scale of the Fourier transform representation of the spectrum. From the mel-frequency-warped Fourier transform representation of the log-magnitude spectrum, a set of cepstral coefficients, or feature parameters, are calculated to represent the speech signals. The extracted cepstral coefficients or parameters are known as feature vectors. They are conveyed to the back-end recognizer to perform the actual probability estimation and classification in order to reconstruct the spoken words.
The DSR front-end 1 and back-end 7, according to Nokia ETSI-STQ W 1008, are shown in FIGS. 1A and 1B, respectively. As shown in FIG. 1A, as the speech signal 100 is conveyed to a time-domain pre-processing block 2, it is converted to a digital signal. The digital signal is segmented into frames, each having N samples. An FFT block 3 is used to compute from the pre-processed signal a magnitude spectrum and generate N-spectral magnitude values. In particular, a Fast Fourier Transform is performed to produce a set of coefficients or spectral values. Typically, the entire spectrum of coefficients is conveyed to a full-band processing block 4 to compute a set of mel-frequency cepstral coefficients (MFCCs). At the same time, the same spectrum of coefficients is divided into sub-parts, each corresponding to a different frequency sub-band to be processed by a plurality of sub-band processing blocks 41, . . . , 4B into additional sets of MFCCs. From the sets of MFCCs, a feature-vector assembling block 5 forms a data unit, known as a feature vector, for each frame. Often, additional information concerning the time derivatives of each MFCC is also provided. For example, a feature vector may also contain information about the first and second time derivatives of each cepstral coefficient. A conventional method for incorporating temporal information into speech vectors is to apply linear regression to a series of successive cepstral coefficients to generate first- and second-different cepstra, referred to as ‘delta’ and ‘delta-delta’ cepstra. Although the feature vector can be transmitted, as such, to a back-end for speech recognition, it is usually preferred to reduce the amount of data to be transmitted. Thus, the feature vector of each frame is subjected to down sampling by a factor of 2 or 3 by a down-sampling device 6 before speech data is transmitted to the back-end. The down-sampled speech data is denoted by reference numeral 160. It should be noted that the time domain processing block 2, the FFT block 3, the processing means 4, 41, . . . , 4B, and the cepstral feature vector assembling block 5 are basically the same as the corresponding blocks 20, 30, 40, 401, . . . , 40B, 50 of the distributed speech recognition front-end of the present invention, as shown in FIG. 2. These blocks will be described in more detail in conjunction with FIG. 2 later.
At the DSR back-end 7, as shown in FIG. 1B, the received feature-vector coefficients 160″ are up-sampled by the same down-sampling factor by an up-sampling device 8 so that the up-sampled features are reproduced at the original frame rate. The static feature-vector coefficients are then augmented with their first- and second-order time derivatives at block 9. The first- and second-order derivatives are appended to the static coefficients to produce the feature vector for one frame. At the final block 10 of the back-end as shown in FIG. 1B, a simple recursive normalization is usually carried out on the cepstral feature-vector domain in order to reduce the mismatch that may occur between training and testing environments. The output 190 from the block 10 is signal indicative of normalized feature vectors.
One of the major disadvantages of the DSR methodology, as set forth by ETSI, is that the statistics of speech signals vary greatly, depending on the test environment of the speech recognition system. Thus, the noise component in the feature parameters may not be effectively removed. In a noisy environment, the efficiency of speech recognition, in terms of word accuracy, may not be high enough.
Thus, it is desirable to provide a distributed-speech feature extraction method and system, wherein the noise component can be removed effectively.