Speech recognition is the process by which computers analyze sounds and attempt to characterize them as particular letters, words, or phrases. Generally, a speech recognition system is “trained” with many phoneme examples. A phoneme is a basic unit of sound in a given lexicon. For example, it is generally agreed that the English language possesses approximately 50 unique phonemes. Each phoneme may include several variations in its pronunciation, referred to as allophones. There are approximately 700 allophones in the IBM speech recognition system used hereafter for demonstration purposes. The terms allophones and phonemes are used interchangeably herein.
A speech recognition system examines various features from each phoneme example by mathematically modeling its sounds on a multidimensional landscape using multiple Gaussian distributions.
Once acoustic models of phonemes are created, input speech to be recognized is sliced into small samples of sound. Each sample is converted into a multidimensional feature vector by analyzing the same features as previously used to examine the phonemes. Speech recognition is then performed by statistically matching the feature vector with the closest phoneme model. Thus, the accuracy, or word error rate (WER), of a speech recognition system is dependent on how well the acoustic models of phonemes represent the sound samples input by the system.
Gender specific models, i.e., separate female and male acoustic models of phonemes, are known to yield improved recognition accuracy over gender independent models. The conventional use of such models is to build one system with just female models and one system with just male models. At test time, samples are decoded using both systems in a two-pass approach. While such gender specific systems provide better speech recognition results, they generally require too much computing power and resources to be practical in many real-world applications.