1. Field of the Invention
The present invention generally relates to speech recognition systems and, more particularly, to the reconstructing of high dimensional density feature sets from lower dimensional projections and the use of maximum entropy and maximum likelihood criterion to optimize the directions of projection to improve the performance of such systems.
2. Background Description
The basic problem of speech recognition is the identification of sequences of words from a prescribed vocabulary corresponding to spoken utterances. Each word in the vocabulary can be thought of as a sequence of basic sounds, consisting of an alphabet of about fifty sounds (called xe2x80x9cphonemesxe2x80x9d). Therefore, the goal of speech recognition is to model the speech signal for each basic sound in such a way that it is possible to identify them by observing the sampled speech signal. Due to inherent redundancies in the acoustic signal a parsimonious representation is achieved by extracting feature vectors periodically (every 10 ms). The feature vectors should be related to the local spectral content of the speech signal, and should preserve the information necessary to enable discrimination between different phonemes. Furthermore, the acoustic features associated with a phoneme depend on a multitude of circumstances surrounding its formation (realization).
An important step in the speech recognition process is to isolate important features of the waveform over small time intervals (typically 25 ms). These features are represented by vector x xcex5 Rd (where d usually is 39) which are then identified with context dependent sounds. Strings of such basic sounds are then converted into words using a dictionary of acoustic representations of words. In an ideal situation the feature vectors generated by the speech waveform would be converted into a string of phonemes corresponding to the spoken utterance.
A problem associated with this process is to identify a phoneme label for an individual acoustic vector x. Training data is provided for the purpose of classifying a given acoustic vector. A standard approach for classification in speech recognition is to generate initial xe2x80x9cprototypesxe2x80x9d by K-means clustering and then to refine them by using the EM algorithm based on mixture models of gaussian densities. See, for example, Frederick Jelenik, Statistical Methods for Speech Recognition, MIT Press (1997). Moreover, in the decoding stage of speech recognition the output probability density functions are most commonly assumed to be a mixture of Gaussian density functions.
Density estimation of high dimensional data arises in speech recognition via classification of the training data. Specifically, acoustic vectors for a given sound are viewed as a random variable whose density is estimated from the data. Consequently, the training stage requires that densities be found for all basic sounds. From this information we can assign to any acoustic vector the phoneme label corresponding to the highest likelihood obtained from these probability densities. This information is the basis of the translation of acoustic vectors into text.
Speech data is characteristically represented by high dimensional vectors and each basic sound has several thousand data vectors to model it (typically, 3000 to 5000 for each of approximately 3500 basic sounds). Purely Gaussian densities have been known to be inadequate for this purpose due to the heavy tailed distributions observed by speech feature vectors. As an intended remedy to this problem, practically all speech recognition systems attempt modeling by using a mixture model with Gaussian densities for mixture components. Variants of the standard K-means clustering algorithm are used for this purpose. The classical version of the K-means algorithm can also be viewed as an special case of the EM algorithm (cf., David W. Scott, Multivariate Density Estimation, Wiley Interscience (1992)) for mixtures of Gaussians with variances tending to zero. Attempts to model the phonetic units in speech with non-Gaussian mixture densities are described by S. Basu and C. A. Micchelli, in xe2x80x9cParametric Density Estimation for the Classification of Acoustic Feature Vectors in Speech Recognition, Nonlinear Modeling: Advanced Black-Box Techniques, Eds. J. A. K. Suykens and J. Vandewalle, pp. 87-118, Kluwer Academic Publishers, Boston (1998).
There exists a large literature on estimating probability densities associated with univariate data. However, corresponding methods for estimating multivariate probability densities prove to be problematic due to various reasons. See again David W. Scott, Multivariate Density Estimation, Wiley Interscience (1992). This is especially true in the context of very high dimensions. Crucial among these difficulties is the fact that the data appears to be increasingly sparse with the increase in the number of dimensions.
It is therefore an object of the present invention to provide improvements in speech recognition systems.
Recognizing that acoustic vectors are important instances of high dimensional statistical data for which feature recognition can enhance the performance of density estimation in classification and training, we approach the problem by considering projections of the high dimensional data on lower dimensional subspaces, say a single dimensional subspace, subsequently by estimating the univariate probability densities via known univariate techniques, and then by reconstructing the density in the original higher dimensional space from the collection of univariate densities so obtained. In some sense the approach is reminiscent of function reconstruction from projections (e.g., in computerized tomography). The reconstructed density is by no means unique unless further restrictions on the estimated density are imposed. The variety of choices of candidate univariate densities as well as the choices of subspaces on which to project the data including their number further add to this non-uniqueness. One can then consider probability density functions that maximize certain optimality criterion as a solution to this problem. For the purpose of the present invention, we consider those probability density functions that either maximize the entropy functional, or alternatively, the likelihood associated with the data.