1. Technical Field
The present disclosure relates to automatic speech recognition and more specifically to adaptation-specific acoustic models.
2. Introduction
Automatic speech recognition (ASR) systems adapt features to existing models. Some automatic speech recognition systems adapt features in terms of just one speech segment at a time, i.e. a mixture of Gaussian distributions, to find the nearest one to the new speech. However, if the speaker is an outlier, and is close to the boundary of the overall speech statistics for that sound, a local mixture component is close to it, the change in the feature will be minimal. The minimal change leads to a relatively small overall performance increase. This approach often leads to misrecognitions, slow performance, and upset users.
The main problem in the approaches known in the art is that the model used is very rich in structure; it has many distributions in a mixture representing a speech sound. When the ASR system performs adaptation, it matches a speech frame to the nearest distribution in a mixture and transforms it to match as well as possible. This can result in the ASR system moving features away from the overall centroid for the given sound, which leads to poor ASR system and/or spoken dialog system performance.