The key to making effective speech recognition systems is the creation of acoustic models, grammars and language models that enable the underlying speech recognition technology to reliably recognise what is being said and to make some sense of or understand the speech given the context of the speech sample within the application. The process of creating acoustic models, grammars and language models involves collecting a database of speech samples (also commonly referred to as voice samples) which represent the way speakers interact with speech recognition system. To create the acoustic models, grammars and language models each speech sample in the database needs to be segmented and labelled into their word or phoneme constituent parts. Then the entire common constituent parts for all speakers (such as all speakers saying the word “two”, for example) are then compiled and processed to create the word (or phoneme) acoustic model for that constituent part. In large vocabulary phoneme based systems, the process also needs to be repeated to create the language and accent specific models and grammar for that linguistic market. Typically, around 1,000 to 2,000 examples of each word or phoneme (from each gender) are required to produce an acoustic model that can accurately recognise speech.
Developing speech recognition systems for any linguistic market is a data driven process. Without the speech data representative of the language and accent specific to that market the appropriate acoustic, grammar and language models cannot be produced. It follows that obtaining the necessary speech data (assuming it is available) and creating the appropriate language and accent specific models for a new linguistic market can be particularly time consuming and very costly.
It would be advantageous if there was provided a speech recognition system that could be automatically configured for any linguistic market in a cost effective manner.