1. Field of the Invention
This disclosure relates to speech recognition and, more particularly, to gender-dependent models for continuous speech recognition.
2. Description of the Related Art
Gender dependent speech recognition systems are usually created by splitting or fragmenting training data into each gender and building two separate acoustic models one for each gender. Fragmenting assumes that every state of a sub-phonetic model is uniformly dependent on gender. These gender-dependent systems have not yielded any significant improvements to speech recognition. Some important disadvantages of the conventional gender dependent speech recognition systems are 1) fragmenting training data even when unnecessary and 2) the need to store a complete acoustic model for each gender.
Acoustic training data is typically divided into 10 msec segments called frames. Each frame is represented by an acoustic feature vector. For example, a 1 sec duration would contain 100 frames. The sequence of acoustic vectors are aligned to the phonetic transcription of the utterance. Each phone has three states. After alignment, each state has a subset of acoustic vectors associated with each state which can be modeled by Gaussian prototypes.
Current acoustic models are built by querying the surrounding context to model context-dependent variations. For example, the phone ae in the context of cat (k ae t) may be different than ae in married, since the surrounding context in cat (k and t) is different from married (m and r). This context difference does not necessarily mean that the realization is different with respect to gender. However, the realization of ae in the context of cat may vary across gender.
Therefore, a need exists to model gender differences that are not sufficiently modeled by context-dependent variations. There is also a need for a gender dependent speech recognition method which takes advantage of phonetic differences in speech patterns by different genders without fragmenting training data when unnecessary to reduce the amount of storage space required by the speech recognition system.