The broad goal of speech recognition technology is to create devices that can receive spoken information and act appropriately upon that information. In order to maximize benefit and universal applicability, speech recognition systems (SRSs) should be capable of recognizing continuous speech, and should be able to recognize multiple speakers with possibly diverse accents, speaking styles, and different vocabularies and grammatical tendencies. Effective SRSs should also be able to recognize poorly articulated speech, and should have the ability to recognize speech in noisy environments.
Models of sub-word sized speech units form the backbone of virtually all SRSs. Many systems use phonemes to define the dictionary, but some SRSs use allophones. A phoneme is the basic theoretical unit for describing how speech conveys linguistic meaning. As such, the phonemes of a language comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language; this is to be compared with the actual sounds that are produced in speaking, which speech scientists call allophones. Each phoneme can be considered to be a code that consists of a unique set of articulatory gestures. Once a speaker has formed a thought to be communicated to a listener, they construct a phrase or sentence by choosing from a collection of phonemes, or finite mutually exclusive sounds. If speakers could exactly and consistently produce these phoneme sounds, speech would amount to a stream of discrete codes. However, because of many different factors including, for example, accents, gender, and coarticulatory effects, every phoneme has a variety of acoustic manifestations in the course of flowing speech. Thus, from an acoustical point of view, the phoneme actually represents a class of sounds that convey the same meaning.
The problem involved in speech recognition is enabling the speech recognition system with the appropriate language constraints. Whether phones, phonemes, syllables, or words are viewed as the basic unit of speech, language, or linguistic, constraints are generally concerned with how these fundamental units may be concatenated, in what order, in what context, and with what intended meaning. For example, if a speaker is asked to voice a phoneme in isolation, the phoneme will be clearly identifiable in the acoustic waveform. However, when spoken in context, phoneme boundaries become difficult to label because of the physical properties of the speech articulators. Since the vocal tract articulators consist of human tissue, their positioning from one phoneme to the next is executed by movement of muscles that control articulator movement. As such, there is a period of transition between phonemes that can modify the manner in which a phoneme is produced. Therefore, associated with each phoneme is a collection of allophones, or variations on phones, that represent acoustic variations of the basic phoneme unit. Allophones represent the permissible freedom allowed within a particular language in producing a phoneme, and this flexibility is dependent on the phoneme as well as on the phoneme position within an utterance.
The typical modern speech recognition systems operate under the principle that, in some form or another, they maximize the a posteriori probability of some sequence of words W given some acoustic evidence A, where the probability is denoted Pr(W/A). Using Bayes' rule, this amounts to maximizing Pr(A/W)xPr(W), where Pr(A/W) is provided by a specified acoustic model and Pr (W) is provided by a specified language model. It should be noted that this formulation can be extended to other fields, such as handwriting recognition, by changing Pr(W/A) appropriately; the language model component need not change since it characterizes the language itself. Therefore, language modeling plays a central role in the recognition process, where it is typically used to constrain the acoustic analysis, guide the search through various partial text hypotheses, and contribute to the determination of the final transcription.
Two statistically-based paradigms have traditionally been exploited as language models to derive the probability Pr(W). The first one, the finite state grammar paradigm, relies on rule-based grammars, while the second one, the n-grammar paradigm, involves data-driven n-grams. The finite state grammar paradigm may be based on parsing or other structural a priori knowledge of the application domain, while the n-gram paradigm translates the probability of occurrence in the language of all possible strings of n words. Consequently, the finite state grammar paradigms are typically used for well-defined, small vocabulary applications such as command and control recognition, while the n-gram paradigms are typically applied to general large vocabulary dictation within some typically broad domain.
The reason for this dichotomy is well understood. In command and control applications, the number of words used for system command and control is typically limited as are the scope and complexity of the formulations. Therefore, it is straightforward to build a finite state grammar-based model to constrain the domain accordingly. In contrast, in a dictation application, potentially anything could be uttered having an arbitrary degree of complexity making reliance on a finite state grammar-based model impractical. It makes sense in the case of a dictation application to exploit the statistical patterns of the language as a knowledge source, assuming a sufficient amount of training text, or data, is available.
While the command and control and the dictation applications cover extreme ends of the speech recognition spectrum, there is an important intermediate case of a large vocabulary interaction, in which the scope and complexity of the utterances are greater than in traditional command and control, while still more constrained, for example, by a dialog model, than in traditional dictation. This situation is likely to become pervasive in future SRS user interfaces because, as the size of the vocabulary increases, finite state grammar-based models become less and less effective. There are several reasons for the decreasing effectiveness of the finite state grammar-based models. First, from a purely algorithmic perspective, the larger the grammar, the fewer constraints it offers, and therefore the lower the accuracy of the speech recognition system. Furthermore, from a SRS user's point of view, the more complex the formulation allowed, the more difficult it is to remember exactly which variations are in-grammar and which are not. As a result, in a typical SRS application that uses a finite state grammar-based model, accuracy degrades significantly if the number of language items is greater than approximately 100. This is an order of magnitude short of what a typical dialog system might require in the near future.
In contrast to finite state grammar-based models, n-gram-based models have been successfully constructed for vocabulary sizes up to approximately 60,000 words. They are typically estimated on large machine-readable text databases, comprising, for example, newspaper or magazine articles in a given broad domain. However, due to the finite size of such databases, numerous occurrences of n-word strings are infrequently encountered, yielding unreliable SRS model parameter values or coefficients. As a result, interest has been generated in fairly sophisticated parameter estimation and smoothing. Unfortunately, it remains extremely challenging to go beyond n.ltoreq.4, with currently available databases and processing power. Thus, n-gram-based models alone are inadequate to capture large-span constraints present in dialog data, even if a suitable database could be collected, stored, and processed. Consequently, there is a need for a speech recognition system using a language model that integrates a finite state grammar paradigm and an n-gram paradigm into a statistical language modeling framework so as to provide speech recognition in the intermediate case of a large vocabulary interaction.