1. Technical Field
The present invention relates generally to speech recognition; and, more particularly, it relates to the generation of speaker dependent hidden Markov models for speech recognition training within systems that employ speech recognition.
2. Related Art
Conventional speech recognition systems commonly employ complex training methods. While these complex training methods do in fact provide good model parameter estimation, in the instance where little training data is available, the quality of the model parameter estimation is significantly compromised. In addition, these conventional speech recognition systems that employ these complex training methods inherently require a significant amount of training data. However, it is not practical to collect a significant amount of speech for speaker dependent training systems. Conventional speech recognition systems simply do not perform well with limited training and limited training data. Moreover, these conventional speech recognition systems require a significant amount of memory and processing resources to perform the complex training methods. Such a memory and computationally intensive solution for training a speech recognition system is not amenable to embedded systems. While these conventional speech recognition systems that employ complex training methods are quite amenable where there is a large availability of such memory and processing resources, they are not transferable to systems where memory and processing resources are constrained and limited, i.e., embedded systems.
Certain conventional systems also employ hidden Markov modeling (HMM) for training of the speaker dependent speech recognition system. For example, within conventional systems that seek to represent a large number of states, those speech recognition systems inherently require the significant amount of processing during training of the speaker dependent speech recognition system, thereby requiring a significant amount of memory and processing resources. The conventional methods of training speaker dependent speech recognition systems simply do not provide an adequate method of performing simplified training when the memory and processing resources of the speech recognition system are constrained.
Further limitations and disadvantages of conventional and traditional systems will become apparent to one of skill in the art through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.
Various aspects of the present invention can be found in a speech processing system that determines an end-point and a beginning point of a word within a speech utterance. The speech processing system contains, among other things, an element selection circuitry, a speech segmentation circuitry, and a state determination circuitry. The element selection circuitry selects a subset of elements from a feature vector using various criteria. These various criteria are used to perform the selection of certain elements within the feature vector. The various criteria include, among other things, clustering and correlation between multiple elements of the speech utterance. The speech segmentation circuitry uses the selected number of elements from the feature vector to determine the boundaries of the segments of the speech utterance. Subsequently, after the segments of the speech utterance are chosen, the features vector for each frame in the speech segment are used to estimate the model parameters of the speech utterance. For embodiments of the invention that employ uniform segmentation, no subset of elements from the feature vector need to be selected, as the speech utterance is segmented into a number of segments having equal width irrespective of the feature vector.
A training token is a speech utterance used to perform training of the speech recognition system. For single token training, the training method is very simple and easily trainable. The model parameters are estimated from a single training token in this case. For training that employs multiple training tokens, the same segmentation process is performed to segment each training token, and the model parameters are estimated using the segments of all the multiple training tokens. In another implementation, the first training token is segmented either uniformly or non-uniformly, as described above, and the model parameters are estimated based on the segmentation. The other embodiments employing training tokens, in multiple training token systems, are segmented using the previously estimated model parameters, and a single iteration of Viterbi alignment. The new segments are used to update the previously estimated model parameters.
The state determination circuitry determines a number of states to be modeled from the speech utterance. The number of states of the speech utterance corresponds exactly to the number of segments into which the speech utterance is segmented. The number of states of the speech utterance is determined by the number of frames in the end-pointed speech utterance. The segmentation of the speech utterance is uniform in certain embodiments of the invention, and non-uniform in other embodiments of the invention. If desired, the speech processing system is operable within either of a speech recognition training system of a speech recognition system.
Other aspects of the invention can be found in a speech recognition training system that generates a model used to perform speech recognition on a speech signal. The speech recognition training system contains, among other things, a model generation circuitry, a speech segmentation circuitry, and a state determination circuitry. As described above, the feature vector is generated from the speech signal. The speech segmentation circuitry uses a number of elements from the feature vector to perform segmentation of the speech signal. The state determination circuitry determines a number of states of the speech signal. The number of states of the speech utterance corresponds exactly to the number of segments into which the speech utterance is segmented. The number of states of the speech utterance is determined by the number of frames in the end-pointed speech utterance. Moreover, in certain embodiments of the invention, a single iteration of Viterbi alignment is performed to determine the segmentation of the speech signal. In various embodiments, the segmentation of the speech signal is uniform; in others, the segmentation of the speech signal is non-uniform. Moreover, the speech recognition training system performs end-point detection of a speech utterance of the speech signal.
Even other aspects of the invention can be found in a method that generates a model used to perform speech recognition on a speech signal. The method involves, among other steps, the selection of a number of elements from a feature vector, the segmentation of the speech signal using the number of elements from the feature vector, and the determination of a number of states of the speech signal. As described above, in various embodiments of the invention, the segmentation of the speech signal is uniform; in others, it is non-uniform.
Other aspects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.