This invention relates to speech recognition.
Prior-art speech recognition systems consist of two main parts: a feature extraction (or front-end) stage and a pattern matching (or back-end) stage. The front-end effectively extracts speech parameters (typically referred to as features) relevant for recognition of a speech signal. The back-end receives these features and performs the actual recognition. In addition to reducing the amount of redundancy of the speech signal, it is also very important for the front-end to mitigate the effect of environmental factors, such as noise and/or factors specific to the terminal and acoustic environment.
The task of the feature extraction front-end is to convert a realtime speech signal into a parametric representation in such a way that the most important information is extracted from the speech signal. The back-end is typically based on a Hidden Markov Model (HMM), a statistical model that adapts to speech in such a way that the probable words or phonemes are recognised from a set of parameters corresponding to distinct states of speech. The speech features provide these parameters.
It is possible to distribute the speech recognition operation so that the front-end and the back-end are separate from each other, for example the front-end may reside in a mobile telephone and the back-end may be elsewhere and connected to a mobile telephone network. Naturally, speech features extracted by a front-end can be used in a device comprising both the front-end and the back-end. The objective is that the extracted feature vectors are robust to distortions caused by background noise, non-ideal equipment used to capture the speech signal and a communications channel if distributed speech recognition is used.
Speech recognition of a captured speech signal typically begins with analogue-to-digital-conversion, pre-emphasis and segmentation of a time-domain electrical speech signal. Pre-emphasis emphasises the amplitude of the speech signal at such frequencies in which the amplitude is usually smaller. Segmentation segments the signal into frames, each representing a short time period, usually 20 to 30 milliseconds. The frames are either temporally overlapping or non-overlapping. The speech features are generated using these frames, often in the form of Mel-Frequency Cepstral Coefficients (MFCCs).
MFCCs may provide good speech recognition accuracy in situations where there is little or no background noise, but performance drops significantly in the presence of only moderate levels of noise. Several techniques exist to improve the noise robustness of speech recognition front-ends that employ the MFCC approach. So-called cepstral domain parameter normalisation (CN) is one of the most effective techniques known to date. Methods falling into this class attempt to normalise the extracted features in such a way that certain desirable statistical properties in the cepstral domain are achieved over the entire input utterance, for example zero mean, or zero mean and unity variance.
FIG. 1 is a drawing taken from WO 94/22132 showing the structure of a typical MFCC front-end. A digitised speech signal is pre-emphasised (PE), and then framed into overlapping segments (Fr). In each frame the signal is multiplied by a window function (W). This may be, for example a Hamming window function. Next, a Fast Fourier Transform (FFT) is applied, resulting in a spectral representation of the speech spectrum of the speech signal, that is a set of spectral magnitude values in the frequency domain. The spectral representation is further processed by a filter-bank (Mel) which models the characteristics of the human auditory system (Mel-filtering). This results in a set of sub-band values. Each sub-band value represents spectral magnitude values of a certain frequency sub-band of the frequency domain speech spectrum. Mel filtering also removes a considerable amount of redundant information from the signal. Non-linear compression, typically implemented by taking the logarithm of the sub-band values, is used to create a balance between high and low amplitude components. This step models the sensitivity function of the human auditory system. Logarithm computation generates a log-spectral signal comprising logarithms of sub-band values. The log-spectral signal is then de-correlated by applying a discrete cosine transform DCT. The DCT process applied to the log-spectral sub-band values produces the MFCCs. The number of MFCCs is typically smaller than the number of sub-band value logarithms. These components are further processed in a normalisation block (CN) in order to improve the noise robustness of speech recognition that is eventually carried out in the back-end.
Mean removal is an optional processing step that may be applied to reduce some slowly changing parts of the speech signal. This is based on the idea that slow variations do not carry linguistic information but are rather attributed to environmental factors. It can be demonstrated that under certain circumstances mean removal may actually lead to a reduction in speech recognition accuracy.
The present invention introduces a concept that speech recognition accuracy may actually be improved by performing a mean emphasis instead of mean removal in combination with cepstral domain parameter normalisation.
According to a first aspect of the invention there is provided a speech recognition feature extractor for extracting speech features from a speech signal, comprising:
a time-to-frequency domain transformer for generating spectral magnitude values in the frequency domain from the speech signal;
a frequency domain filtering block for generating a sub-band value relating to spectral magnitude values of a certain frequency sub-band, for each of a group of frequency sub-bands;
a compression block for compressing said sub-band values; and
a transformation block for obtaining a set of de-correlated features from the sub-band values;
a normalising block for normalising features; characterised by said feature extractor comprising:
a mean emphasising block for emphasising at least one of the sub-band values after frequency domain filtering.
Preferably, all the sub-band values are mean emphasised. Alternatively, some of the sub-band values are mean emphasised.
Preferably, the compression block is a non-linear compression block. Preferably, non-linear compression is performed using a logarithmic function.
Preferably, the transformation block is a linear transformation block. Preferably, the linear transformation is performed using a discrete cosine transform (DCT).
Preferably, said frequency domain filtering block is arranged to generate sub-band values according to a scale based on an auditory model (auditory based scale).
Preferably, said feature extractor comprises a differentiation block for generation of first time derivatives and second time derivatives for each of said de-correlated features; and
said normalising block is arranged to generate normalised speech features using said de-correlated features, said first derivatives features, and said second derivatives.
Preferably, said mean emphasising block is arranged to add a mean estimate term to each sub-band value that is to be mean emphasised. Preferably, the mean estimate term is calculated from compressed sub-band values representing a series of at least two subsequent speech frames.
According to a second aspect of the invention there is provided a mobile station comprising a speech recognition feature extractor for extracting speech features from a speech signal, said extractor comprising:
a time-to-frequency domain transformer for generating from the speech signal spectral magnitude values in the frequency domain;
a frequency domain filtering block for generating a sub-band value relating to spectral magnitude values of a certain frequency sub-band, for each of a group of frequency sub-bands;
a compression block for compressing said sub-band values;
a transformation block for obtaining a set of de-correlated features from the sub-band values; and
a normalising block for normalising features; characterised by said feature extractor comprising
a mean emphasising block for emphasising at least one of the sub-band values after frequency domain filtering.
Preferably, the compression block is a non-linear compression block. Preferably, non-linear compression is performed using a logarithmic function.
Preferably, the transformation block is a linear transformation block. Preferably, the linear transformation is performed using a discrete cosine transform (DCT).
According to a third aspect of the invention there is provided a method for extracting speech features from a speech signal, comprising the steps of:
generating spectral magnitude values in the frequency domain from the speech signal;
generating a sub-band value relating to spectral magnitude values of a certain frequency sub-band, for each of a group of frequency sub-bands;
compressing said sub-band values by applying compression to each sub-band value;
obtaining a set of de-correlated features from the sub-band values; and
normalising features; characterised by said method comprising the steps of:
emphasising at least one of the sub-band values after frequency domain filtering.
According to a fourth aspect of the invention there is provided a computer program for extracting speech features from a speech signal, comprising:
a computer readable program means for causing a computer to generate spectral magnitude values in the frequency domain from the speech signal;
a computer readable program means for causing a computer to generate a sub-band value relating to spectral magnitude values of a certain frequency sub-band, for each of a group of frequency sub-bands;
a computer readable program means for causing a computer to compress said sub-band values;
a computer readable program means for causing a computer to obtain a set of de-correlated features from the sub-band values;
a computer readable program means for causing a computer to normalise features; characterised by said computer program product comprising:
a computer readable program means for causing a computer to emphasise at least one of the sub-band values after frequency domain filtering.