1. Field of the Invention
This invention relates to speech recognition and in particular the generation of features for use in speech recognition.
2. Related Art
Automated speech recognition systems are generally designed for a particular use. For example, a service that is to be accessed by the general public requires a generic speech recognition system designed to recognise speech from any user. Automated speech recognisers associated with data specific to a user are used either to recognise a user or to verify a user's claimed identity (so-called speaker recognition).
Automated speech recognition systems receive an input signal from a microphone, either directly or indirectly (e.g. via a telecommunications link). The input signal is then processed by speech processing means which typically divide the input signal into successive time segments or frames by producing an appropriate (spectral) representation of the characteristics of the time-varying input signal. Common techniques of spectral analysis are linear predictive coding (LPC) and Fourier transform. Next the spectral measurements are converted into a set or vector of features that describe the broad acoustic properties of the input signals. The most common features used in speech recognition are mel-frequency cepstral coefficients (MFCCs).
The feature vectors are then compared with a plurality of patterns representing or relating in some way to words (or parts thereof) or phrases to be recognised. The results of the comparison indicate the word/phrase deemed to have been recognised.
The pattern matching approach to speech recognition generally involves one of two techniques: template matching or statistical modelling. In the former case, a template is formed representing the spectral properties of a typical speech signal representing a word. Each template is the concatenation of spectral frames over the duration of the speech. A typical sequence of speech frames for a pattern is thus produced via an averaging procedure and an input signal is compared to these templates. One well-known and widely used statistical method of characterizing the spectral properties of the frames of a pattern is the hidden Markov model (HMM) approach. The underlying assumption of the HMM (or any other type of statistical model) is that the speech signal can be characterized as a parametric random process and that the parameters of the stochastic process can be determined in a precise, well-defined manner.
A well known deficiency of current pattern-matching techniques, especially HMMs, is the lack of an effective mechanism for the utilisation of the correlation of the feature extraction. A left-right HMM provides a temporal structure of modelling the time evolution of speech spectral characteristics from one state into the next, but within each state the observation vectors are assumed to be independent and identically distributed (IID). The IID assumption states that there is no correlation between successive speech vectors. This implies that within each state the speech vectors are associated with identical probability density functions (PDFs) which have the same mean and covariance. This further implies that the spectral-time trajectory within each state is a randomly fluctuating curve with a stationary mean. However in reality the spectral-time trajectory clearly has a definite direction as it moves from one speech event to the next.
This violation by the spectral vectors of the IID assumption contributes to a limitation in the performance of HMMs. Including some temporal information into the speech feature can lessen the effect of this assumption that speech is a stationary independent process, and can be used to improve recognition performance.
A conventional method which allows the inclusion of temporal information into the feature vector is to augment the feature vector with first and second order time derivatives of the cepstrum, and with first and second order time derivatives of a log energy parameter. Such techniques are described by J. G. Wilpon, C. H. Lee and L. R. Rabiner in "Improvements in Connected Digit Recognition Using Higher Order Spectral and Energy Features", Speech Processing 1, Toronto, May 14-17, 1991, Institute of Electrical and Electronic Engineers pages 349-352.
A mathematically more implicit representation of speech dynamics is the cepstral-time matrix which uses a cosine transform to encode the temporal information as described in B P Milner and S V Vaseghi, "An analysis of cepstral-time feature matrices for noise and channel robust speech recognition", Proc. Eurospeech, pp 519-522, 1995. The cepstral time matrix is also described by M. Pawlewski et al in "Advances in telephony based speech recognition" BT Technology Journal Vol 14, No 1.
A cepstral-time matrix, c.sub.t (m,n), is obtained either by applying a 2-D Discrete Cosine Transform (DCT) to a spectral-time matrix or by applying a 1-D DCT to a stacking of mel-frequency cepstral coefficients (MFCC) speech vectors. M N-dimensional log filter bank vectors are stacked together to form a spectral-time matrix. X.sub.t (f,k), where t indicates the time frame, f the filter bank channel and k the time vector in the matrix. The spectral-time matrix is then transformed into a cepstral-time matrix using a two dimensional DCT. Since a two-dimensional DCT can be divided into two one-dimensional DCTs, an alternative implementation of the cepstral-time matrix is to apply a 1-D DCT along the time axis of a matrix consisting of M conventional MFCC vectors.