The present invention relates to a method of speaker adaptation for a Hidden Markov Model based voice recognition system.
Voice recognition systems can be employed in many fields. For example, possible areas of use are dialog systems that permit purely voice communication between a speaker and an information or booking machine over the telephone. Other areas of application of a voice recognition system are operation of an infotainment system in the car by the driver and control of an operation assistance system by the surgeon if use of a keyboard entails disadvantages. Another important area of use is dictation systems, which enable texts to be written faster and more easily.
It is incomparably harder to recognize the voice of a random speaker than to recognize a known speaker. This is because of the wide variations in how different people speak. For this reason there are speaker-dependent and non-speaker-dependent voice recognition systems. Speaker-dependent voice recognition systems can only be used by a speaker known to the system, but for this speaker achieve a particularly high level of recognition. Non-speaker-dependent voice recognition systems can be used by any speaker, but the level of recognition lags way behind that of a speaker-dependent voice recognition system. In many applications, speaker-dependent voice recognition systems cannot be used, as for example in the case of telephone information and booking systems. However, if only a restricted group of people uses a voice recognition system, as for example in the case of a dictation system, a speaker-dependent voice recognition system is frequently used.
Commercially available voice recognition systems are generally speaker-dependent, with the voice recognition system first being trained to the voice of the speaker before it can be used.
In practice two methods are frequently used for speaker adaptation. In a first method of vocal tract length normalization (VTLN) the frequency axis of the voice spectrum is stretched or compressed linearly in order to align the spectrum to that of a reference speaker.
In a second method the voice signal remains unaltered. Instead, acoustic models of the voice recognition system, mostly in the form of reference data of a reference speaker, are adapted to the new speaker using a linear transformation. This method has more free parameters and hence is more flexible than vocal tract length normalization.
A disadvantage of the second method is that the modified reference data has to be buffered and permanently saved in several steps when the voice adaptation algorithm is executed. This requires a lot of memory space, which primarily negatively affects applications on devices with restricted processor power and limited memory space, such as mobile radio terminals for example.