In language processing, it is common for a human phonetician to segment words into sequences of phonetic elements. The phonetic elements are selected from the International Phonetic Alphabet. Phones are relatively small segments of words which trained linguists can recognize as different sounding segments of a word (for example, i,e,ae,s. all represent phones.) Typically, the phonetician listens to a word and, based upon his expertise, matches successive portions with respective phonetic elements to determine the proper phonetic spelling of a pronounced word.
Such phonetic sequences have been applied in standard dictionaries. Also, phonetic sequences have been applied to speech recognition in general, and to speech recognition utilizing Hidden Markov models (hereinafter referred to as "HMM") in particular. In the case of HMM speech recognition, the various phonetic elements are represented by respective HMMs. Each word then corresponds to a sequence of phonetic HMMs.
A sub-element of a phone is the fenone. Fenones often change so rapidly that a trained listener cannot always recognize their occurrence. For example, when the word "beat-" is spoken, the phones are recognized as "b", "e", and "t". The fenones within each phone change rapidly and a single phone can be considered to be a sequence of several fenones. The phone "t" in the word "beat" may contain several fenones, e.g. 5.
An important consequence of using sub-word building blocks such as phones and fenones is that automatic speech recognition system can be trained using a relatively small amount of data. The training data need only contain samples of each phone or fenone, instead of several samples of each word. However, if each phone is modelled independently without regard to the effects of context-dependence or co-articulation, the resulting acoustic models may be inaccurate due to the fact: that a pronunciation of a phone depends on the neighboring phones.
From the above, it can be appreciated that a technique which provides a speech recognition program which dynamically changes the projection, and thus the feature extraction based upon the position of the present phone or fenone with respect to the neighboring phones or fenones would be very useful in providing more accurate speech recognition.