1. Field of the Invention
The present invention relates generally to the field of speech recognition and specifically to a method and apparatus for processing speech information to automatically generate phonological rules which may be used,.for example, to facilitate the recognition of continuous speech.
2. Description of the Prior Art
Most speech recognition systems operate, at least at a high level of abstraction, in substantially the same manner. Discontinuous spoken words are converted to sampled data electrical signals which are then analyzed to generate a sequence of tokens representing specific sounds. These tokens are analyzed to determine which word or words correspond to the sequence of tokens. The words so determined are provided as the output of the speech recognition system.
Many speech recognition systems of this type analyze only discontinuous speech, that is to say words spoken with interstitial pauses. This limitation makes the systems easier to design since phonological models of words spoken in this manner tend to be more consistent than the rules which may apply for the more natural continuous speech. These phonological models are used in the analysis of the sampled electrical signals.
An exemplary system for recognizing discontinuous spoken words is described below in reference to FIGS. 1-4, labeled "prior art." In the described system, each word in a prescribed vocabulary is represented as a sequence of component parts known as phonemes. Each of these sequences is known as a "baseform" and represents an idealized pronunciation of the word. Traditionally, Phonetic-baseforms have been generated by phoneticians.
In the system described below, each phoneme in a baseform is represented by a statistical model called a "phonemic phone machine." A phonemic phone machine represents a phoneme as a probabilistic combination of sound samples called "fenemes" or, more simply, "labels." Statistics are developed for each phone machine by analyzing a known spoken text. Some known speech recognition systems use baseforms which have fenemes rather than phonemes as their component parts. In this instance, a fenemic phone machine, i.e. one in which each feneme is represented as a probabilistic combination of fenemes, is used to model the pronunciation of the feneme. The exemplary system shown in FIGS. 1-4 uses phonemic baseforms and phonemic phone machines.
Once the statistics for the phone machines have been developed, they may be used to analyze the sampled data signal representing fenemes derived from individually uttered words to determine one or more likely sequences of phonemes that correspond to the sampled data signal.
This sequence of phonemes is then compared to selected ones of the baseforms, which incorporate the likely sequences of phonemes, to decide which words from the prescribed vocabulary are most likely to have been spoken.
This type of voice recognition system works reasonably well for discontinuous speech because individually spoken words tend to conform to the idealized baseforms. However, in continuous speech, coarticulation among the words tends to reduce the conformance of a spoken word to any idealized model, such as a baseform.
A phonetic baseform may be compensated for coarticulation effects by specifying rules which change the baseform based on the context in which it is pronounced. Typically, these rules are also specified by phoneticians. But, due to the wide variety of coarticulation effects which may occur even in a limited vocabulary, the specification of these modifying rules can be a formidable task.
U.S. Pat. No. 4,759,068 to Bahl et al. relates a method by which a string of tokens derived from spoken words are analyzed to derive a sequence of individual fenemes which most closely corresponds to the spoken words. This patent discloses the structure of a typical speech recognition system in detail.
U.S. Pat. No. 4,559,604 to Ichikawa et al. relates to a pattern recognition system in which an input pattern is compared against a set of standard patterns to define a set of patterns that are more likely than any other patterns to be a match for the input pattern. A specific one of these selected patterns is inferred as the most likely based on one of four preferred criteria of inference.
U.S. Pat. No. 4,363,102 to Holmgren et al. relates to a speaker identification system which develops a plurality of templates corresponding to known words spoken by a corresponding plurality of speakers. An individual speaker is identified as having the smallest probabilistic distance between his spoken words and the templates corresponding to one of the known speakers.