Speech recognition systems have become a common form of input for computer systems. A typical speech recognition system captures an audible signal and analyzes for recognizable components of human speech. Segmentation of speech into units, such as phonemes, syllables or vowels, provides information about both phonological and rhythmic aspects of speech. Phonemes (sometimes called phones) are generally regarded as the minimal meaningful phonological segment of speech. Phonemes include vowels and consonants. The term syllable is used to describe a segment of speech consisting of vowels alone or of consonants preceding or following. Usually vowels constitute the syllable nucleus. Detection of phone, vowel, and syllable boundary therefore plays an important role in speech recognition and natural language understanding. In many spoken language processing applications it is useful to determine where a syllable begins and ends within a sample speech signal. Since a spoken syllable typically includes a vowel portion as the syllable nucleus and may or may not include a consonant portion an important key to syllable boundary detection is therefore detection of the vowel and/or vowel boundary within a syllable. A phoneme boundary can be detected after the vowel or syllable boundary is detected by using more traditional features, such as energy, voice probability, zero crossing, spectral change rate at different FFT frequency bin, cepstrum, delta cepstrum, and delta-delta cepstrum, frame based phoneme probability, lip movement by analysis video image of the lips, with or without auditory attention cues. Researchers have found supporting arguments indicating that syllables are one of the most important elements in human speech perception. Segmentation of speech into syllabic units provides insights regarding speech rate, rhythm, prosody, and speech recognition and speech synthesis. A syllable contains a central peak of sonority (syllable nucleus), which is usually a vowel, and the consonants that cluster around this central peak. Most of the work in the literature focuses on syllable nucleus detection since it is more reliable and easier to locate compared to precise syllable boundaries. For syllable nucleus detection, most of the existing methods rely on estimating a one-dimensional continuous curve from extracted short-time acoustic features and performing a peak search on the curve to locate syllable nuclei. Some of the acoustic features that are used to locate syllable nuclei include energy in selected critical bands, linear predictive coding spectra, sub band-based correlation, pitch, voicing, etc. Some examples of state-of-the art work in this field include:
“Robust Speech Rate Estimation for Spontaneous Speech”, Dagen Wang and Shrikanth S. Narayanan, in IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007, pp 2190-2201.
“Segmentation of Speech into Syllable-like units” T. Nagarajan et al, EUROSPEECH 2003—GENEVA, pp 2893-2896.
“Speech rhythm guided syllable nuclei detection”, Y. Zhang and J. Glass, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp 3797-3800, Taipei, Taiwan April 2009.
Usually these traditional methods require tuning lots of parameters, which is not desirable since it makes it hard to use them for different settings or conditions; i.e. new data, new conditions such as speaking style, noise conditions etc. In addition, the traditional methods usually focus on vague syllable nuclei detection
It is within this context that embodiments of the present invention arise.