In the past few decades, discriminative training (DT) has been a very active research topic in the field of automatic speech recognition (ASR). Many DT methods have been proposed to estimate Gaussian mixture continuous density hidden continuous density Markov models (CDHMMs) in a variety of speech recognition tasks, ranging from small vocabulary isolated word recognition to large vocabulary continuous speech recognition tasks. Generally speaking, DT of CDHMMs is a typical optimization problem that starts with formulation of an objective function according to certain estimation criterion. Some popular DT criteria widely used in speech recognition include maximum mutual information (MMI), minimum error estimation (MCE), minimum word or phone error (MWE or MPE), minimum divergence (MD), and so on. Once the objective function is formulated, an effective optimization method must be used to minimize or maximize the objective function with respect to its CDHMM parameters.
With respect to optimization, in speech recognition, several different methods have been used to optimize the derived objective function, including the GPD (generalized probabilistic descent) algorithm based on the first-order gradient descent method, and the approximate second-order Quickprop method, and the extended Baum-Welch (EBW) algorithm based on growth transformation and so on.
The GPD and Quickprop methods are mainly used for optimizing the MCE-derived objective function even though they are general optimization methods which can be used for any types of differentiable objective functions. On the other hand, the EBW method has been initially proposed to maximize a rational objective function and later extended to Gaussian mixture CDHMMs for the MMI and MPE (or MWE) objective functions. Recently, the EBW method has also been generalized to optimize the MCE objective function as well as the MD objective function.
The EBW method has been widely accepted for DT because it is relatively easy to be implemented on word graphs for large scale ASR tasks and it has been demonstrated that the EBW algorithm performs well in many tasks. Essentially, all of these optimization methods attempt to search for a nearby locally optimal point of the objective function from an initial point according to both a search direction and a step size. Normally, the search direction is locally computed based on the first-order derivative (such as gradient) and the step size must be empirically determined in practice. As the result, the performance of these optimization methods highly depends on the location of the initial point and the property of objective functions. If the derived objective function is highly nonlinear, jagged and non-convex in nature, it is extremely difficult to optimize it effectively with any simple optimization algorithm, which is one of the major difficulties of DT of HMMs for speech recognition.
As described herein, various exemplary techniques can be used to optimize CDHMM parameters for applications such as speech recognition or, more generally, pattern recognition.