A speech recognition engine is a speech controlled interface providing a useful and sometimes improved means for controlling and inputting data to existing computer applications. An essential part of a speech recognition engine is a speech model comprising a language model (words) and an acoustic model (basic sound units). Words in the language model are formed from combinations of the basic sound units defined in the acoustic model. Each basic sound unit represents electronic characteristics of speech input over a sample period. A speech recognition engine receives speech samples and matches them with the basic sound units in the acoustic model. The speech recognition engine then calculates the most likely words from the language model based on the matched basic sound units.
Two distinct types of speech recognition in this specification are personal speech recognition and generic speech recognition. A personal speech recognition system, for instance a mobile phone speech recognition system, is characterized in that the personal speech model used is specific to the user and has been adapted by the user through training. Initial training is performed during a first use of the personal system and training continues during normal use of the personal system. A personal speech model comprises unique sets of electronic characteristics for the acoustic model and a unique language model for the words formed from combinations of unique basic sound units. A shared speech recognition system for example for a car or telephony system uses a generic speech model comprising averages of language and acoustic models collected from a large sample of users. Generic speech recognition can use generic speech models because it typically has more powerful memory and processing resources at its disposal and can store and process much greater volumes of data than a personal system.
A known generic speech recognition system in a telephony interactive voice response system (IVR) stores personal speech models where each personal speech model is selected based on the telephone number of the user and each personal speech model is trained by the user. But training is not a desired feature of IVRs as telephony users are not as technically tolerant as personal speech recognition users and they demand seamless speech recognition or no speech recognition at all. Moreover telephony speech recognition users would not like to train both a personal speech model on a desktop and a generic speech model on a shared IVR.
It would be advantageous to have a generic speech model in an IVR that benefits from personal speech models.