I. Field of the Invention
The present invention relates generally to speech recognition. More particularly, the present invention relates to a system and method for segmentation of speech signals for purposes of speech recognition.
II. Description of the Related Art
Pattern recognition techniques have been widely used in speech recognition. The basic idea in the technique is to compare the input speech pattern with a set of templates, each of which represents a pre-recorded speech pattern in a vocabulary. The recognition result is the word in the vocabulary associated with the template which has the most similar speech pattern to that of the input speech pattern.
For human beings, it is usually not necessary to hear all the detail in an utterance (e.g., a word) in order to recognize the utterance. This fact shows that there are some natural redundancies inherent in speech. Many techniques have been developed to recognize speech taking advantage of such redundancies. For example, U.S. Pat. No. 5,056,150 to Yu et al. discloses a real time speech recognition system wherein a nonlinear time-normalization method is used to normalize a speech pattern to a predetermined length by only keeping spectra with significant time-dynamic attributes. Using this method, the speech pattern is compressed significantly, although it may occasionally keep the same spectrum repeatedly.
Another technique for speech recognition employs a sequence of acoustic segments, which represent a sequence of spectral frames. The segments are the basic speech units upon which speech recognition is based. One procedure for generating the acoustic segments, or performing segmentation, is to search for the most probable discontinuity points in the spectral sequence using a dynamic programming method. These selected points are used as the segment boundaries. See J. Cohen, "Segmenting Speech Using Dynamic Programming," J. Acoustic Soc. of America, May 1981, vol. 69(5), pp. 1430-1437. This technique, like the technique of U.S. Pat. No. 5,056,150 described above, is based on the searching of significant time-dynamic attributes in the speech pattern.
Another technique used to segment speech is based on the segmental K-means training procedure. See L. R. Rabiner et al., "A Segmental K-means Training Procedure for Connected Word Recognition," AT&T Technical Journal, May/June 1986 Vol. 65(3), pp. 21-31. Using an iterative training procedure, an utterance is segmented into words or subword units. Each of the units is then used as a speech template in a speech recognition system. The iterative training procedure requires many steps of computation, so that it cannot be implemented in real time.
These problems and deficiencies are recognized and solved by the present invention in the manner described below.