This invention relates to speech recognition.
Speech recognition is well known in the field of computers. Nowadays it is being applied to mobile telephones and particularly to enable voice dialling functionality. With voice dialling, a user can, for example, say the name of a person whom he or she wants to call to, the telephone recognises the name and then looks up a corresponding number. Alternatively, the user may directly say the telephone number he requires. This is convenient, since the user does not have to use keys. It is desirable to increase the ability of mobile telephones to understand spoken words, letters, numerals and other spoken information to a greater extent. Unfortunately, current speech recognition techniques require too much processing capacity to be practically used in a small portable mobile telephone.
Speech recognition functionality can be implemented in a telephone network, in such a way that a telephone user""s speech is recognised in the network rather than in a handset. By locating speech recognition functionality in the network, greater processing power can be made available. However, the accuracy of speech recognition is degraded by distortions introduced into the speech signal and by the reduction in bandwidth that results from its transmission to the network, In a typical landline connection, the bandwidth of the speech signal transferred to the network is only about 3 kHz, which means that a significant part of the voice spectrum is lost and thus the information it contains is unavailable for use in speech recognition. This problem can be avoided by dividing speech recognition functionality between the telephone handset and the network.
WO 95/17746 describes a system in which an initial stage of speech recognition is carried out in a remote station. The remote station generates parameters characteristic of the voice signal, so-called xe2x80x9cspeech featuresxe2x80x9d and transmits them to a central processing station which is provided with the functionality to process the features further. In this way, the features can be extracted e.g. from a speech signal using the entire spectrum captured by a microphone of the remote station. Additionally, the required transmission bandwidth between the remote station and the central processing station is also reduced. Instead of transmitting a speech signal to convey the speech in electrical format, only a limited number (e.g. tens) of parameters (features) are transmitted for each speech frame.
The two main blocks typically present in speech recognition systems are a signal processing front-end, where feature extraction is performed, and a back-end, where pattern matching is performed to recognise spoken information. It is worth mentioning, that division of speech recognition into these two parts, front-end and back-end, is also feasible in cases other than a distributed speech recognition system. The task of the signal processing front-end is to convert a real-time speech signal into some kind of parametric representation in such a way that the most important information is extracted from the speech signal. The back-end is typically based on a Hidden Markov Model (HMM) that adapts to a speaker so that the probable words or phonemes are recognised from a set of parameters corresponding to distinct states of speech. The speech features provide these parameters. The objective is that the extracted feature vectors are robust to distortions caused by background noise, a communications channel, or audio equipment (for example, that used to capture the speech signal).
Prior art systems often derive speech features using a front-end algorithm based on Mel-Frequency Cepstral Coefficients (MFCCs). MFCCs provide good accuracy in situations where there is little or no background noise, but performance drops significantly in the presence of only moderate levels of noise. Thus, there is a need for a method that has a corresponding performance at low levels of background noise and significantly better performance in noisier conditions.
The noise which disturbs the speech recognition process originates from various sources. Many of these noise sources are so-called convolutional noise sources. In other words, the effect they have on the speech signal can be represented as a mathematical convolution between the noise source and the speech signal. The vocal tract of the user and the electrical components used in speech acquisition and processing can both be considered as convolutional noise sources. The user""s vocal tract has an acoustic transfer function determined by its physical configuration and the electrical components of the acquisition and processing system have certain electronic transfer functions. The transfer function of the user""s vocal tract affects, among other things, the pitch of the spoken information uttered by the user, as well as its general frequency characteristics. The transfer functions of the electrical components, which usually include a microphone, an amplifier(s) and an Analogue-to-Digital (AID) converter, for converting the signal captured by the microphone into digital form, affect the frequency content of the captured speech information. Thus, both the user-specific transfer function of the vocal tract and the device-specific electronic transfer function(s) effectively cause inter-user and inter-device variability in the properties of the speech information acquired for speech recognition. The provision of a speech recognition system that is substantially immune to these kinds of variations is a demanding technical task.
Speech recognition of a captured speech signal typically begins with A/D-conversion, pre-emphasis and segmentation of a time-domain electrical speech signal. At the pre-emphasis stage, the amplitude of the speech signal is enhanced in certain frequency ranges, usually those in which the amplitude is smaller. Segmentation segments the signal into frames representing a short time period, usually 20 to 30 milliseconds. The frames are formed in such a way that they are either temporally overlapping or non-overlapping. Speech features are generated using these frames, often in the form of Mel-Frequency Cepstral Coefficients (MFCCs). It should be noted that although much of the description which follows concentrates on the use of Mel-Frequency Cepstral Coefficients in the derivation of speech features, application of the invention is not limited to systems in which MFCCs are used. Other parameters may also be used as speech features. WO 94/122132 describes the generation of MFCCs. The operation of an MFCC generator described in that publication is shown in FIG. 1. A segmented speech signal is received by a time-to-frequency-domain conversion unit. In step 101, a speech frame is transformed into the frequency domain with a Fast Fourier Transform (FFT) algorithm to provide 256 transform coefficients. In step 102, a power spectrum of 128 coefficients is formed from the transform coefficients. In step 103, the power spectrum is integrated over 19 frequency bands to provide 19 band power coefficients. In step 104, a logarithm is computed from each of the 19 band power coefficients to provide 19 log-values. In step 105, a Discrete Cosine Transform (DCT) is performed on the 19 log-values. The frequency domain signal is then processed in a noise reduction block in order to suppress noise in the signal. Finally, the 8 lowest order coefficients are selected.
It should be appreciated, that the numbers of samples and various coefficients referred to in WO 94/22132 represent only one example.
It is a characteristic of linear transforms, for example DCTs, that disturbance caused by noise in a certain frequency band is spread to surrounding frequency bands. This is an undesirable effect, particularly in speech recognition applications.
In Okawa et al., xe2x80x9cMultiband Speech Recognition In Noisy Environments,xe2x80x9d IEEE, 1998, pp. 641-644 (IEEE 0-7803-4428-6/98) a multi-band automatic speech recognition method is presented. In this method a speech signal is divided into different sub-parts of the entire frequency band of the signal. Then each sub-part is processed separately. In this case, narrow-band noise occurring in one frequency sub-part does not spread from one sub-part to another frequency sub-part. The method has shown good results in the case where the majority of the frequency band is not affected by noise, for example, in the presence of narrow band noise. However, when the noise is spread widely over the frequency band of the speech signal, word recognition accuracy can drop by up to 25%. The method is thus appropriate only under certain types of noise, for example to compensate for car engine noise that appears only in a relatively narrow frequency band.
It is an object of the present invention to improve speech recognition accuracy for various noise types and under different noise conditions.
According to a first aspect of the disclosed embodiments a speech recognition feature extractor includes a time-to-frequency domain transformer for generating spectral values in the frequency domain from a speech signal, and a partitioning block for generating a first set of spectral values in the frequency domain and an additional set of spectral values in the frequency domain. The first aspect also includes a first feature generator for generating a first group of speech features using the first set of spectral values and an additional feature generator for generating an additional group of speech features using the additional set of spectral values. The feature generators are arranged to operate in parallel.
An assembler is included for assembling an output set of speech features from at least one speech feature from the first group of speech features and at least one speech feature from the additional group of speech features, and an anti-aliasing and rate reduction block, configured to convert the output set of speech features to a data reduced output set. Furthermore, the first and additional set of spectral values include at least one common spectral value.
According to a second aspect of the disclosed embodiments a speech recognition system includes a speech recognition feature extractor and a back-end for recognising spoken information from speech features.
The feature extractor includes a time-to-frequency domain transformer for generating spectral values in the frequency domain from a speech signal, a partitioning block for generating a first set of spectral values in the frequency domain and an additional set of spectral values in the frequency domain, a first feature generator for generating a first group of speech features using the first set of spectral values, and an additional feature generator for generating an additional group of speech features using the additional set of spectral values. The feature generators are arranged to operate in parallel, and the first and additional set of spectral values include at least one common spectral value.
The feature extractor also includes an assembler for assembling an output set of speech features from at least one speech feature from the first group of speech features and at least one speech feature from the additional group of speech features and an anti-aliasing and rate reduction block, configured to convert the output set of speech features to a data reduced output set.
The back-end includes a data bank for maintaining statistical models of spoken information, a block for receiving speech features relating to two different frequency ranges of a speech frame, and a recognition block for selecting from the data bank, a model of spoken information that best matches received speech features.
The various aspects of the disclosed embodiments may be implemented in a number of apparatus"" and methods and may also be embodied in a computer program product that includes computer readable program code.
The spoken information may be a word and statistical models of words may be maintained. Alternatively, the spoken information may be a phoneme and statistical models of phonemes may be maintained. In yet another embodiment, the spoken information may be an utterance and statistical models of utterances may be maintained.
The statistical models of spoken information, referred to above, may be maintained in a data bank.
It should be appreciated that the inventive concept enabling the embodiments of the first aspect also applies to the other aspects, but in order to condense this document all these numerous embodiments are not expressly written out.