In the art of language processing, particularly in the application thereof exemplified herein of systems for recognition of natural language in spoken form, an ultimate objective will normally be the recognition and/or classification of an unknown pattern, such as a speech fragment, through use of a set of decision rules applied to a comparison of the unknown pattern and stored parameters representative of patterns of the same general class. In general, the steps in such a recognition/classification process can be described as:
performing feature extraction for a training set of patterns, thereby providing a parameterized description for each pattern in such a training set, and providing a similar feature extraction for an unknown pattern; PA1 using a set of labelled training set patterns to infer decision rules, thereby defining a mapping from an unknown object (or pattern) to a known pattern in the training set; and PA1 carrying out that mapping to define the recognition or classification of the unknown pattern.
In a system for recognizing spoken words or phrases from a natural language, a large corpus of training speech is segmented into elemental speech units such as phones and multiple instances of each phone are used to develop an acoustic model of the phone. Ideally the corpus of training speech will be sufficiently large that virtually all variants of each phone that may be encountered by the recognizer are included in the model for the phone. Each model will be defined in terms of the previously described parameterized description for the modeled speech sound.
The architecture for implementation of the modelling of sound patterns in such a speech recognizer system has become largely standardized. Specifically, the various levels of linguistic information--i.e., the language (or grammar) model, the word pronunciation (or lexicon) models, and the phone models--are represented by a cascade of networks linked by a single operation of substitution. Each network at a given level accepts sequences of symbols. Each of those symbols is modeled by a network at the level immediately below. For example, the pronunciation of each word in a word sequence accepted by the language model will be modeled by a phone string, or, more generally, by a phonetic network encoding alternative pronunciations. Each phone is itself modeled, typically by a Hidden Markov Model ("HMM"), to represent possible sequences of acoustic observations in realizations of the phone. Thus to create the recognizer network, a phone string (or phonetic network) will be substituted for the corresponding word label in the language model and an HMM will be substituted for each phone label in the Lexicon model.
This architecture allows a wide range of implementations. In principle, the cascaded networks can be expanded in advance into a single network accepting sequences of the lowest-level inputs (eg. acoustic observations) by applying recursively the substitution of a symbol by a network. However, if the networks involved are large, full expansion is not practical, since the resulting network would be too large to be computationally tractable. Instead, a typical recognizer system will use a hybrid policy in which some levels are fully expanded but others may be expanded on-demand. In such a recognizer, sequences of hypotheses of units at level i are assembled until they correspond to paths from an initial node to a final node in some model of a unit of level i+1, at which point that higher-level unit is hypothesized.
The hybrid arrangement described above works well so long as the combination of modeling levels can be done by substitution alone. However, an improvement in recognizer systems generally has had the effect of limiting the application of on-demand modeling where that improvement is implemented. It has been determined in recent years that the use of context-dependent units at appropriate levels of representation significantly improves the performance of such a recognition system. By its very nature, a context-dependent model, u/c (for unit u in context c), can be substituted for an instance of u only when that instance appears in context c. In prior-art recognizer systems, this constraint is addressed in one of two main ways. If the cascaded networks involved are small enough, the full cascade is expanded in advance down to the level of context-dependent units, using a specialized expansion algorithm that folds in context dependency appropriately. If full expansion is not practical, as, for instance, in large-vocabulary recognition, the standard solution is to use restricted model structures and specialized algorithms to allow on-demand combination of representation levels. A particular problem occurs at word boundaries, where context must be determined as to each adjacent word which can appear in that position. Here, because of the multiplicity of possible contexts for a phone at a word boundary position, substitution does not work. A common restriction is to allow only one-sided context-dependent models at such word boundaries. But even where full context-dependency may be implemented, the particular context-dependency type (eg. triphonic, pentaphonic) must be built into the decoder, thereby preventing any other form of context-dependency being used with such a recognizer system.