1. Field of the Invention
The present invention generally relates to methods of speech recognition and, more particularly, to nonparametric density estimation of high dimensional data for use in training models for speech recognition.
2. Background Description
In the present invention, we are concerned with nonparametric density estimation of high dimensional data. The invention is driven by its potential application to training speech data where traditionally only parametric methods have been used. Parametric models typically lead to large scale optimization problems associated with a desire to maximize the likelihood of the data. In particular, mixture models of gaussians are used for training acoustic vectors for speech recognition, and the parameters of the model are obtained by using K-means clustering and the EM algorithm, see F. Jelinek, Statistical Methods for Speech Recognition, The MIT Press, Cambridge Mass., 1998. Here we consider the possibility of maximizing the penalized likelihood of the data as a means to identify nonparametric density estimators, see I. J. Good and R. A. Gaskin, xe2x80x9cNonparametric roughness penalties for probability densities,xe2x80x9d Biometrika 58, pp. 255-77, 1971. We develop various mathematical properties of this point of view, propose several algorithms for the numerical solution of the optimization problems we encounter, and we report on some of our computational experience with these methods. In this regard, we integrate within our framework a technique that is central in many aspects of the statistical analysis of acoustic data, namely the Baum Welch algorithm, which is especially important for the training of Hidden Markov Models, see again the book by F. Jelinek, cited above.
Let us recall the mechanism in which density estimation of high dimensional data arises in speech recognition. In this context, a principal task is to convert acoustic waveforms into text. The first step in the process is to isolate important features of the waveform over small time intervals (typically 25 mls). These features, represented by a vector xxcex5Rd (where d usually is 39) are then identified with context dependent sounds, for example, phonemes such as xe2x80x9cAAxe2x80x9d, xe2x80x9cAExe2x80x9d, xe2x80x9cKxe2x80x9d, xe2x80x9cHxe2x80x9d. Strings of such basic sounds are then converted into words using a dictionary of acoustic representations of words. For example, the phonetic spelling of the word xe2x80x9ccatxe2x80x9d is xe2x80x9cK AE Txe2x80x9d. In an ideal situation the feature vectors generated by the speech waveform would be converted into a string of phonemes xe2x80x9cK . . . K AE . . . AE T . . . Txe2x80x9d from which we can recognize the word xe2x80x9ccatxe2x80x9d (unfortunately, a phoneme string seldom matches the acoustic spelling exactly).
One of the important problems associated with this process is to identify a phoneme label for an individual acoustic vector x. Training data is provided for the purpose of classifying a given acoustic vector. A standard approach for classification in speech recognition is to generate initial xe2x80x9cprototypesxe2x80x9d by K-means clustering and then refine them by using the EM algorithm based on mixture models of gaussian densities, cf F. Jelinek, cited above. Moreover, in the decoding stage of speech recognition (formation of Hidden Markoff Models) the output probability density functions are most commonly assumed to be a mixture of gaussian density functions, cf. L. E. Baum and J. A. Eagon, xe2x80x9cAn inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model of ecology,xe2x80x9d Bull. Amer. Math. Soc. 73, pp. 360-63, 1967; L. A. Liporace, xe2x80x9cPiecewise polynomial, positive definite and compactly supported radial functions of minimal degree,xe2x80x9d IEEE Trans. on Information Theory 5, pp. 729-34, 1982; R. A. Gopinath, xe2x80x9cConstrained maximum likelihood modeling with gaussian distributions,xe2x80x9d Broadcast News Transcription and Understanding Workshop, 1998.
According to this invention, we adopt the commonly used approach to classification and think of the acoustic vectors for a given sound as a random variable whose density is estimated from the data. When the densities are found for all the basic sounds (this is the training stage) an acoustic vector is assigned the phoneme label corresponding to the highest scoring likelihood (probability). This information is the basis of the decoding of acoustic vectors into text.
Since in speech recognition x is typically a high dimensional vector and each basic sound has only several thousand data vectors to model it, the training data is relatively sparse. Recent work on the classification of acoustic vectors, see S. Basu and C. A. Micchelli, xe2x80x9cMaximum likelihood estimation for acoustic vectors in speech recognition,xe2x80x9d Advanced Black-Box Techniques For Nonlinear Modeling: Theory and Applications, Kluwer Publishers (1998), demonstrates that mixture models with non-gaussian mixture components are useful for parametric density estimation of speech data. We explore the use of nonparametric techniques. Specifically, we use the penalized maximum likelihood approach introduced by Good and Gaskin, cited above. We combine the penalized maximum likelihood approach with the use of the Baum Welch algorithm, see L. E. Baum, T. Petrie, G. Soules and N. Weiss, xe2x80x9cA maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,xe2x80x9d The Annals of Mathematical Statistics 41, No. 1, pp. 164-71, 1970; Baum and Eagon, cited above, often used in speech recognition for training Hidden Markoff Models (HMMs). (This algorithm is a special case of the celebrated EM algorithm as described, e.g., A. P. Dempster, N. M. Liard and D. B. Baum, xe2x80x9cMaximum likelihood from incomplete data via the EM algorithm,xe2x80x9d Journal of Royal Statistical Soc. 39(B), pp. 1-38, 1977.)
We begin by recalling that one of the most widely used nonparametric density estimators has the form                                                         f              n                        ⁡                          (              x              )                                =                                    1                              nh                d                                      ⁢                                          ∑                                  ⅈ                  ∈                                      Z                    n                                                              ⁢                              k                ⁡                                  (                                                            x                      -                                              x                        i                                                              h                                    )                                                                    ,                  x          ∈                      R            d                                              (        1        )            
where Zn={1, . . . , n}, k is some specified function, and {xi:ixcex5Zn} is a set of observations in Rd of some unknown random variable, cf. T. Cacoullos, xe2x80x9cEstimates of a multivariate density,xe2x80x9d Annals of the Institute of Statistical Mathematics 18, pp. 178-89, 1966; E. Parzen, xe2x80x9cOn the estimation of a probability density function and the mode,xe2x80x9d Annals of the Institute of Statistical Mathematics 33, pp. 1065-76, 1962; M. Rosenblatt, xe2x80x9cRemarks on some nonparametric estimates of a density function,xe2x80x9d Annals of Mathematical Statistics 27, pp. 832-37, 1956. It is well known that this estimator converges almost surely to the underlying probability density function (PDF) provided that the kernel k is strictly positive on Rd, ∫Rdk(x)dx=1, hxe2x86x920, nhxe2x86x92∞, and nxe2x86x92∞. The problem of how best to choose n and h for a fixed kernel k for the estimator (1) has been thoroughly discussed in the literature, cf. L. Devroye and L. Gyxc3x6rfi, Nonparametric Density Estimation, The L1 View, John Wiley and Sons, New York, 1985.
In this invention, we are led, by the notion of penalized maximum likelihood estimation (PMLE), to density estimators of the form                                           f            ⁡                          (              x              )                                =                                    ∑                              ⅈ                ∈                                  Z                  n                                                      ⁢                                          c                i                            ⁢                              k                ⁡                                  (                                      x                    ,                                          x                      i                                                        )                                                                    ,                  x          ∈                      R            d                                              (        2        )            
where k(x, y), x, yxcex5Rd is the reproducing kernel in some Hilbert space H, cf S. Saitoh, Theory of Reproducing Kernels and its Applications, Pilman Research Notes in Mathematical Analysis, Longman Scientific and Technical, Essex, UK, 189,1988.
Among the methods we consider, the coefficients in this sum are chosen to maximize the homogeneous polynomial                                           ∏                          xe2x80x83                        ⁢                          (              Kc              )                                :=                                    ∏                              ⅈ                ∈                                  Z                  n                                                      ⁢                          xe2x80x83                        ⁢                          (                                                ∑                                      j                    ∈                                          Z                      n                                                                      ⁢                                                      K                    ij                                    ⁢                                      c                    j                                                              )                                      ,                  c          =                                    (                                                c                  1                                ,                …                ⁢                                  xe2x80x83                                ,                                  c                  n                                            )                        T                          ,                            (        3        )            
over the simplex
Sn={c:cxcex5R+n, eTc=1},xe2x80x83xe2x80x83(4)
where e=(1, . . . ,1)Txcex5Rn,
xe2x80x83R+n={c:c=(c1, . . . ,cn)T, cixe2x89xa70, ixcex5Zn},xe2x80x83xe2x80x83(5)
the positive orthant, K is the matrix
K={Kij}i,jxcex5Zn={k(xi,xj)}i,jxcex5Znxe2x80x83xe2x80x83(6)
and accomplish this numerically by the use of the Baum Welch algorithm, cf. L. E. Baum, T. Petrie, G. Soules and N. Weiss, cited above; L. E. Baum and J. A. Eagon, cited above. A polynomial in the factored form (3) appears in the method of deleted interpolation which occurs in language modeling, see L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer, and D. Nahamoo, xe2x80x9cA fast algorithm for deleted interpolation,xe2x80x9d Proceedings Eurospeech 3, pp. 1209-12, 1991. In the context of geometric modeling they have been called lineal polynomials, see A. S. Cavaretta and C. A. Micchelli, xe2x80x9cDesign of curves and surfaces by subdivision algorithms,xe2x80x9d in Mathematical Methods in Computer Aided Geometric Design, T. Lyche and L. Schumaker (eds.), Academic Press, Boston, 1989, 115-53, and C. A. Micchelli, Mathematical Aspects of Geometric Modeling, CBMSF-NSF Regional Conference Series in Applied Mathematics 65, SIAM Philadelphia, 1995. A comparison of the Baum Welch algorithm to the degree raising algorithm, see C. A. Micchelli and A. Pinkus, xe2x80x9cSome remarks on nonnegative polynomials on polyhedraxe2x80x9d in Probability, Statistics and Mathematics: Papers in Honor of Samuel Karlin, Eds. T. W. Anderson, K. B. Athreya and D. L. Iglehart, Academic Press, San Diego, pp. 163-86, 1989, which can also be used to find the maximum of a homogeneous polynomial over a simplex, will be made. We also elaborate upon the connection of these ideas to the problem of the diagonal similarity of a symmetric nonsingular matrix with nonnegative elements to a doubly stochastic matrix, see M. Marcus and M. Newman, xe2x80x9cThe permanent of a symmetric matrix,xe2x80x9d Abstract 587-85, Amer. Math. Soc. Notices 8, 595; R. Sinkhorn, xe2x80x9cA relationship between arbitrary positive matrices and doubly stochastic matrices,xe2x80x9d Ann. Math. Statist. 38, pp. 439-55, 1964. This problem has attracted active interest in the literature, see M. Bacharach, xe2x80x9cBiproportional Matrices and Input-Output Change,xe2x80x9d Monograph 16, Cambridge University press, 1970; L. M. Bergman, xe2x80x9cProof of the convergence of Sheleikhovskii""s method for a problem with transportation constraints,xe2x80x9d USSR Computational Mathematics and Mathematical Physics 1(1), pp. 191-204, 1967; S. Brualdi, S. Parter and H. Schneider, xe2x80x9cThe diagonal equivalence of a non-negative matrix to a stochastic matrix,xe2x80x9d J. Math. Anal. and Appl. 16, pp. 31-50, 1966; J. Csima and B. N. Datta, xe2x80x9cThe DAD theorem for symmetric non-negative matrices,xe2x80x9d Journal of Combinatorial Theory 12(A), pp. 147-52, 1972; G. M. Engel and H. Schneider, xe2x80x9cAlgorithms for testing the diagonal similarity of matrices and related problems,xe2x80x9d SIAM Journal of Algorithms in Discrete Mathematics 3(4), pp.429-38, 1982; T. E. S. Raghavan, xe2x80x9cOn pairs of multidimensional matrices,xe2x80x9d Linear Algebra and Applications 62, pp. 263-68, 1984; G. M. Engel and H. Schneider, xe2x80x9cMatrices diagonally similar to a symmetric matrix,xe2x80x9d Linear Algebra and Applications 29, pp. 131-38, 1980; J. Franklin and J. Lorenz, xe2x80x9cOn the scaling of multidimensional matrices,xe2x80x9d Linear Algebra and Applications 114/115, pp. 717-35, 1989; D. Hershkowitz, U. G. Rothblum and H. Schneider, xe2x80x9cClassification of nonnegative matrices using diagonal equivalence,xe2x80x9d SIAM Journal on Matrix Analysis and Applications 9(4), pp. 455-60, 1988; S. Karlin and L. Nirenberg, xe2x80x9cOn a theorem of P. Nowosad,xe2x80x9d Mathematical Analysis and Applications 17, pp.61-67, 1967; A. W. Marshall and I. Olkin, xe2x80x9cScaling of matrices to achieve specified row and column sums,xe2x80x9d Numerische Mathematik 12, pp. 83-90, 1968; M. V. Menon and H. Schneider, xe2x80x9cThe spectrum of a nonlinear operator associated with a matrix,xe2x80x9d Linear Algebra and its Applications 2, pp. 321-34, 1969; P. Novosad, xe2x80x9cOn the integral equation Kf=1/f arising in a problem in communication,xe2x80x9d Journal of Mathematical Analysis and Applications 14, pp. 484-92, 1966; T. E. S. Raghavan, xe2x80x9cOn pairs of multidimensional matrices,xe2x80x9d Linear Algebra and Applications 62, pp.263-68, 1984; U. G. Rothblum, xe2x80x9cGeneralized scaling satisfying linear equations,xe2x80x9d Linear Algebra and Applications 114/115, pp. 765-83, 1989; U. G. Rothblum and H. Schneider, xe2x80x9cScalings of matrices which have prescribed row sums and column sums via optimization,xe2x80x9d Linear Algebra and Applications 114/115, pp.737-64, 1989; U. G. Rothblum and H. Schneider, xe2x80x9cCharacterization of optimal scaling of matrices,xe2x80x9d Mathematical Programming 19, pp. 121-36, 1980; U. G. Rothblum, H. Schneider and M. H. Schneider, xe2x80x9cScaling matrices to prescribed row and column maxima,xe2x80x9d SIAM J. Matrix Anal. Appl. 15, pp. 1-14, 1994; B. D. Saunders and H. Schneider, xe2x80x9cFlows on graphs applied to diagonal similarity and diagonal equivalence for matrices,xe2x80x9d Discrete Mathematics 24, pp. 205-20, 1978; B. D. Saunders and H. Schneider, xe2x80x9cCones, graphs and optimal scaling of matrices,xe2x80x9d Linear and Multilinear Algebra 8, pp. 121-35, 1979; M. H. Schneider, xe2x80x9cMatrix scaling, entropy minimization and conjugate duality. I Existence conditions,xe2x80x9d Linear Algebra and Applications 114/115, pp. 785-813, 1989; R. Sinkhorn, cited above; R. Sinkhorn and P. Knopp, xe2x80x9cConcerning nonnegative matrices and doubly stochastic matrices,xe2x80x9d Pacific J. Math. 212, pp. 343-48, 1967, and has diverse applications in economics, operations research, and statistics.
Several of the algorithms described here were tested numerically. We describe their performance both on actual speech data and data generated from various standard probability density functions. However, we restrict our numerical experiments to scalar data and will describe elsewhere statistics on word error rate on the Wall Street Journal speech data base, as used in S. Basu and C. A. Micchelli, cited above.