Speech recognition performance is oftentimes suboptimal when a large grammar search space is involved, such as a voice search task that covers a large number of business names, web search queries, voice dialing requests, etc. Three main suboptimalities that are often exhibited include long recognition latency, poor recognition accuracy, and insufficient grammar coverage.
One existing mobile voice search application uses a nationwide business listing grammar plus a locality grammar at the first stage and re-recognizes the same utterance using a locality-specific business listing grammar at the second stage (where the locality was determined in the first stage). This approach does not address the latency issue, but can improve coverage and accuracy in very specific situations. Another approach attempts to reduce word error rate by voting among outputs of distinct recognizers at the sub-utterance level. The approach and its extensions generally assume each recognizer attempts recognition with a complete grammar for the entire task.