Speech recognition systems conventionally use phonemes to model speech. The duration of various phonemes in input speech utterances can be different, therefore, a conventional speech recognizer performs a segmentation process on the spoken utterance to divide the utterance into segments of speech, where each segment corresponds to a phonetic or sub-phonetic unit. A conventional speech recognizer further maps the segmented utterance into certain phonemes or Hidden Markov Model (HMM) states to complete the speech recognition process. The accuracy of the speech recognition process is, thus, dependent on the segmentation performed by the speech recognizer.
Hidden Markov Models (HMMs) are conventionally used to model phonetic units. Daring conventional HMM expectation maximization (EM) training, HMM models are updated to increase the likelihood of training data. Usually the segmentation of the speech utterances also improves over each iteration of training. Due to a number of reasons, such as, for example, obtaining a poor initial model and the independence assumption with the HMM, segmentation using HMM implicitly during training and subsequent recognition can be poor. Based on the segmentation, the conventional HMM decoder computes phoneme recognition scores that are used to recognize the input speech utterances. The poor segmentation achieved with convention HMM decoders, therefore, has a significant negative impact on the accuracy of the speech recognizer.
As a result, there exists a need for a system and method that improves the segmentation of speech utterances in a speech recognition system.