This invention relates to speech recognition and more particularly to the types of such systems based on a hidden Markov models (HMM) for use in language or speech instruction.
By way of background, an instructive tutorial on hidden Markov modeling processes is found in a 1986 paper by Rabiner et al., "An Introduction to Hidden Markov Models," IEEE ASSP Magazine, Jan. 1986, pp. 4-16.
Various hidden-Markov-model-based speech recognition systems are known and need not be detailed herein. Such systems typically use realizations of phonemes which are statistical models of phonetic segments (including allophones or, more generically, phones) having parameters that are estimated from a set of training examples.
Models of words are made by making a network from appropriate phone models, a phone being an acoustic realization of a phoneme, a phoneme being the minimum unit of speech capable of use in distinguishing words. Recognition consists of finding the most-likely path through the set of word models for the input speech signal.
Known hidden Markov model speech recognition systems are based on a model of speech production known as a Markov source. The speech units being modeled are represented by finite state machines. Probability distributions are associated with the transitions leaving each node (state), specifying the probability of taking each transition when visiting the node. A probability distribution over output symbols is associated with each node. The transition probability distributions implicitly model duration. The output symbol distributions are typically used to model speech signal characteristics such as spectra.
The probability distributions for transitions and output symbols are estimated using labeled examples of speech. Recognition consists of determining the path through the Markov network that has the highest probability of generating the observed sequence. For continuous speech, this path will correspond to a sequence of word models.
Models are known for accounting for out-of-vocabulary speech, herein called reject phone models but sometimes called "filler" models. Such models are described in Rose et al., "A Hidden Markov Model Based Keyword Recognition System," Proceedings of IEEE ICASSP, 1990.
The specific hidden Markov model recognition system employed in conjunction with the present invention is the Decipher speech recognizer, which is available from SRI International of Menlo Park, Calif. The Decipher system incorporates probabilistic phonological information, a trainer capable of training phonetic models with different levels of context dependence, multiple pronunciations for words, and a recognizer. The co-inventors have published with others papers and reports on instructional development peripherally related to this invention. Each mentions early versions of question and answer techniques. See, for example, "Automatic Evaluation and Training in English Pronunciation," Proc. ICSLP 90, Nov. 1990, Kobe, Japan. "Toward Commercial Applications of Speaker-Independent Continuous Speech Recognition," Proceedings of Speech Tech 91, (Apr. 23, 1991) New York, N.Y. "A Voice Interactive Language Instruction System," Proceedings of Eurospeech 91, Genoa, Italy Sep. 25, 1991. These papers described only what an observer of a demonstration might experience.
Other language training technologies are known. For example, U.S. Pat. No. 4,969,194 to Ezawa et al. discloses a system for simple drilling of a user in pronunciation in a language. The system has no speech recognition capabilities, but it appears to have a signal-based feedback mechanism using a comparator which compares a few acoustic characteristics of speech and the fundamental frequency of the speech with a reference set.
U.S. Pat. No. 4,380,438 to Okamoto discloses a digital controller of an analog tape recorder used for recording and playing back a user's own speech. There are no recognition capabilities.
U.S. Pat. No. 4,860,360 to Boggs is a system for evaluating speech in which distortion in a communication channel is analyzed. There is no alignment or recognition of the speech signal against any known vocabulary, as the disclosure relates only to signal analysis and distortion measure computation.
U.S. Pat. No. 4,276,445 to Harbeson describes a speech analysis system which produces little more than an analog pitch display. It is not believed to be relevant to the subject invention.
U.S. Pat. No. 4,641,343 to Holland et al. describes an analog system which extracts formant frequencies which are fed to a microprocessor for ultimate display to a user. The only feedback is a graphic presentation of a signature which is directly computable from the input signal. There is no element of speech recognition or of any other high-level processing.
U.S. Pat. No. 4,783,803 to Baker et al. discloses a speech recognition apparatus and technique which includes means for determining where among frames to look for the start of speech. The disclosure contains a description of a low-level acoustically-based endpoint detector which processes only acoustic parameters, but it does not include higher level, context-sensitive end-point detection capability.
What is needed is a recognition and feedback system which can interact with a user in a linguistic context-sensitive manner to provide tracking of user-reading of a script in a quasi-conversational manner for instructing a user in properly-rendered, native-sounding speech.