Conventional large vocabulary automatic speech recognition (ASR) systems may be well suited for recognizing natural language speech. For example, ASR systems that utilize statistical language models trained on a large corpus of natural language speech may be well suited to accurately recognize a wide range of speech of a general nature and may therefore be suitable for use with general-purpose recognizers. However, such general-purpose ASR systems may not be well suited to recognize speech containing domain-specific content. Specifically, domain-specific content that includes words corresponding to a domain-specific vocabulary such as jargon, technical language, addresses, points of interest, proper names (e.g., a person's contact list), media titles (e.g., a database of song titles and artists, movies, television shows), etc., presents difficulties for general-purpose ASR systems that are trained to recognize natural language.
Domain-specific vocabularies frequently include words that do not appear in the vocabulary of general-purpose ASR systems, include words or phrases that are underrepresented and/or that do not appear in the training data on which such general-purpose ASR systems were trained and/or are large relative to a natural language vocabulary. As a result, general-purpose ASR systems trained to recognize natural language may perform unsatisfactorily when recognizing domain-specific content that includes words from one or more domain-specific vocabularies.
Special-purpose ASR systems are often developed to recognize domain-specific content. As one example, a speech-enabled navigation device may utilize an automatic speech recognizer specifically adapted to recognize geographic addresses and/or points of interest. As another example, a special-purpose ASR system may utilize grammars created to recognize speech from a domain-specific vocabulary and/or may be otherwise adapted to recognize speech from the domain-specific vocabulary. However, special-purpose ASR systems may be unsuitable for recognizing natural language speech and, therefore, their applicability may be relatively limited in scope.
Automatically recognizing mixed-content speech that includes both natural language and domain-specific content, therefore, presents significant challenges to conventional ASR systems. As an illustration, the spoken utterance “Please provide me with directions to 16 Quinobequin Road in Newton, Mass.,” which includes a natural language portion “Please provide me with directions to” and “in” and a domain-specific portion “16 Quinobequin Road” and “Newton, Mass.,” may not be accurately recognized using conventional ASR techniques. In particular, a general-purpose ASR system (e.g., a large vocabulary recognizer trained to recognize natural language) may have difficulty recognizing words in the domain-specific portion. Similarly, a special-purpose ASR system (e.g., a recognizer adapted to recognize particular domain-specific content) may be unable to accurately recognize the natural language portion.
Attempts at using multiple recognizers, each independently recognizing the input speech, and then combining the recognition results in a post-processing step generally produce unsatisfactory results not only because the recognition of speech content for which a particular recognizer is not well suited may be generally poor, but also because the presence of such speech content will typically degrade the recognition performance on speech content for which a particular recognizer is adapted (e.g., the presence of domain-specific content will generally degrade the performance of a general-purpose ASR system in recognizing natural language and the presence of natural language will generally degrade the performance of a special purpose ASR system in recognizing associated domain-specific content). Moreover, it may be difficult to correctly determine which portions of recognition results from the multiple recognizers should be selected to produce the final recognition result.