Speech recognition may be defined as the process of converting a spoken waveform into a textual string of words, such as, for example, a sentence expressed in the English language.
The process of speech recognition may be classified into three major phases: a front-end phase, an acoustic modeling phase, and a language modeling phase. In the front-end phase, “raw” speech signals are spectrally analyzed for salient features and converted into a sequence of digitally encoded feature vectors. In the acoustic modeling phase, the sequence of feature vectors is examined to extract phone sequences (e.g., simple vowel or consonant sounds) using knowledge about acoustic environments, gender and dialect differences, and phonetics. In the language modeling phase, the phone sequences are converted into corresponding word sequences using knowledge of what constitutes a possible word, what words are likely to occur, and in what sequence.
Despite recent advances, it is believed speech recognition systems have not reached the level of sophistication possessed by humans. In particular, the complexity and intricacies of language combined with varied acoustic environments pose significant challenges to realizing a truly human-like speech recognition system. For example, a speech recognition system must contend with lexical and grammatical complexity and variations of spoken language as well as the acoustic uncertainties of different accents and speaking styles. Therefore, to reduce the complexity and limit the uncertainties speech recognition systems may be built on a small scale for specific domain applications, such as, for example, an airline flight/travel information system (ATIS) or telephone directory information system.
To construct a high quality speech recognition system, a large amount of domain data with a variety of linguistic phenomena may be required to guide the system's interpretation of speech and allow it to determine the appropriate action. For example, it is believed that a speech recognition system supporting a medium-sized application-specific domain of approximately 2,000 words may require 20,000 “in-domain” sentences to be collected to construct a proper language training model. The data collection for such a system may be tedious, time consuming, expensive, and may neglect important aspects of speech, such as a speaking style or idiomatic usages. Furthermore, if the number of in-domain sentences collected were less than the required amount, then a “data sparseness” issue may arise wherein the system lacks enough data to sufficiently cover all the varieties of possible expressions used in that particular domain. Hence, training a speech recognition system to support a new application domain may require a significant amount of time and effort due to the amount of data that may need to be collected.
Various techniques may exist to synthesize data for speech dialog systems. As referred to in Hunt, A., and Black, A. “Unit Selection in a concatenative speech synthesis system using a large speech database” Proc of ICASSP-96 (1996), Atlanta, Ga., speech may be synthesized by first setting up a target specification where a string of phonemes required to synthesize the speech together with prosodic features is defined, and then selecting suitable phonetic units from a database for concatenation. As referred to in Weng, F. L., Stolcke, A., and Cohen, M. “Language Modeling for Multilingual Speech Translation” printed in M. Rayner et al. (eds.) Spoken Language Translator Cambridge University Press (2000) 281, a pre-existing grammar may be used to generate phrase chunks (i.e., a complete or partial speech utterance) which may then be interpolated with a small amount of in-domain data, e.g., a few thousand sentences. As referred to in Brown, P. F. et al. “Class-Based n-gram Models of Natural Language” Association for Computational Linguistics 18(4) (1992) pp. 467–479, the problem of predicting a word from previous words in a sample of text may be addressed via n-gram models based on classes of words. The n-gram models may utilize statistical algorithms to assign words to classes based on the frequency of their co-occurrence with other words. The word classes may be used in language modeling to provide a wide range of applications, such as speech recognition or grammar correction. Generating data for new domains was not believed to be readily obtainable via this approach because a low-order n-gram may not contain enough long distance information and a high-order n-gram model may require a large amount of training data that may not be available and/or feasible.