The present invention relates to maintaining and supplying a plurality of speech models. More specifically, the present invention relates to building up a pervasive voice interface.
Speech recognition converts spoken words to text and refers to technology that can recognize speech without being targeted at a single speaker, such as a call system that can recognize arbitrary voices. Speech recognition applications include voice user interfaces such as voice dialing, call routing, appliance control, search, data entry and preparation of structured documents. Speech recognition engines typically require a speech model to recognize speech, which includes two types of files. They typically require an acoustic model, which can be created by taking audio recordings of speech and their transcriptions and compiling them into a statistical representation of the sounds that make up each word. Speech recognition engines also typically require a language model or grammar file. A language model is a file containing probabilities of sequences of words. A grammar file is typically a much smaller file containing sets of predefined combinations of words.
Since the early 1970s, modern speech recognition technology has gradually become fairly mature in some applications from server-based to mobile usage. However, a major hurdle to a pervasive speech recognition application is that there is no systematic and economic methodology to organize the activities of generating, storing, inquiring, and delivering speech recognition models according to the specific conditions and on demand. Some standards and applications exist that attempt to cover broad use situations, such as the distributed speech recognition (DSR) of the European Telecommunications Standards Institute (ETSI). Unfortunately, the standards are based on specific infrastructures without the consideration of universal usage and constrain how speech recognition models are used such that existing approaches can only thrive in specific domains. For example, ETSI DSR can only be used in a telephony area with end points functioning only as speech input/output devices.