ASR technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. A typical ASR system includes several basic elements. A microphone and acoustic interface receives a user's speech and digitizes it into acoustic data. An acoustic pre-processor parses the acoustic data into information-bearing acoustic features. A decoder then uses acoustic models to decode the acoustic features and generate several hypotheses, and can include decision logic to select a best hypothesis of subwords and words corresponding to the users' speech.
In one implementation, vehicle telecommunications devices are equipped with voice dialing features to initiate a telecommunication session. Such voice dialing features are enabled by ASR technology to detect the presence of discrete speech such as a spoken command or spoken control words. For example, a user can initiate a phone call using an ASR-equipped telephone by speaking a command such as “Call” and then speaking digits of a telephone number to be dialed. Ideally, the ASR system performs well regardless of the particular user, the user's dialect, the user's gender, and any ambient noise in the environment in which the ASR system is used.
ASR systems typically include ASR adaptation routines in an attempt to train the ASR system for better performance despite differences in user, user gender, user dialect, or environmental conditions. Using model adaptation techniques, acoustic models are transformed with an adaptation parameter to better match incoming acoustic feature vectors. Conversely, using run time adaptation (RTA) techniques, acoustic feature vectors are transformed with an adaptation parameter to better match acoustic models. Conventional ASR adaptation routines are initialized with default identity matrix parameters, which are independent of user or environmental characteristics. Unfortunately, however, conventional ASR adaptation often requires users to excessively repeat training utterances to train the adaptation parameters to the particular user and to ambient environmental characteristics. Such repetition can frustrate the users.