1. Field of the Invention
This invention relates to a speech recognition apparatus and method. More precisely, this invention relates to a speech recognition apparatus and method useful for learning language skills.
2. Description of the Related Technology
Modern multimedia computers can be an effective tool for learning new language skills. Still and animated visual images, text, and audio can be interwoven in interactive programs. The audio may include both computer-generated (synthesized) speech or large amounts of high quality natural speech, prerecorded and stored efficiently in digital form, and accessed as required for the interactive learning program. The weak link, however, is the ability of a computer to accept and act differentially in response to the speech of a learner in an interactive language learning application.
Even the most powerful computer is still unable to analyze the meaning, syntax, and pronunciation of a user's unrestricted continuous speech in near-real time, as does the human auditory system. Compared to the human capability, the abilities of the best computers (even without the real-time restriction needed for interactive situations) are extremely rudimentary. Various articles have been written that describe the inadequacy of various speech recognition systems, including Richard D. Peacocke & Daryl H. Graf, An Introduction to Speech and Speaker Recognition, COMPUTER Magazine, August. 1990, at 26-34; David B. Roe & Jay G. Wilpon, Whither Speech Recognition: The Next 25 Years, IEEE Communications Magazine, November 1993, at 54-62; Alexander I. Rudnicky, et al., Survey of Current Speech Technology, Communications of the ACN, March 1994, at 52-56.
However, there are many computer-based systems for accepting and acting differentially in response to human speech that emulate enough of the human capability to be useful in specific applications. Among these are "speech understanding" systems which are directed towards identification of the intended meaning or semantic content of an input segment of speech. Speech understanding systems may be useful, for example, in task-oriented human-computer interactions in which it is not important to identify the actual words spoken or obtain an analysis of the syntax or pronunciation of the speech. See U.S. Pat. No. 5,357,596 and Hirohjuki Tsuboi, et al., Real-Time Task-Oriented Speech Understanding System Using Keyword Spotting, IEEE Transactions on Acoustical Speech and Signal Processing, September 1992, at 197-200.
There are other applications, such as automated dictation or voice data entry, for which it is important for a computer to identify the actual words spoken. Systems accepting speech input that emphasize the correct identification of particular words or sequences of words are commonly referred to as "speech recognition" systems, or "word recognition" systems if the speech input segments are limited to single words or short phrases. (Unfortunately, the term "speech recognition" is also used loosely as a descriptive label for the totality of systems accepting and selectively acting in response to speech input.) For example, in a system for automated dictation for English, it is of primary importance to obtain the correct orthographic representation of a spoken word, and meaning is secondary. If the spoken sentence is "He's a bear of a man," or "He will bear the burden," the semantic content of "bear" is only important to the extent that it is not transcribed "bare." In "He tore across the room," the word "tore", to a speech understanding system, must be unambiguously understood as "moved quickly," whereas to a speech recognition system for dictation it is merely the word that happens to be the past tense of "tear."
A third type of system accepting and selectively responding to a speech input can be termed a "speech analysis" system. Speech analysis systems are used in applications in which properly and improperly formed or pronounced speech must be differentiated, whereas speech recognition and speech understanding systems are primarily directed, respectively, to determining the intended linguistic segments or the intended semantic content relative to a specific context or task to be performed.
To be maximally useful for language instruction, a system for interactive human-computer speech communication must be able to exhibit all three of these capabilities of a human tutor familiar with the language: understanding, segment recognition, and speech pattern analysis. To accomplish this, there would have to be either a very significant improvement in the capabilities of present speech input technology or the devising of a system for human-computer speech interaction in which a speech input technology that is constructed according to the present state-of-the-art can be used. This invention is directed toward the latter approach.
Among the attempts to create a restricted emulation of this human speech recognition capability, there is presently a technology available for speech input to a computer that is constructed so as to automatically determine which of a pre-specified, finite number of speech segments was spoken. A software package, or a combination of software and hardware, designed to accomplish this task can be referred to as a discrete speech recognition engine ("DSRE"). This type of system has been applied to purposes such as voice data entry and providing voice control over some of the functions of a computer or voice control over the functions of a system operated under the control of a computer. In another application, a DSRE with a large vocabulary can be used for dictation, if the words are spoken separately with a short pause after each word. This capability is often referred to as discrete speech recognition.
A DSRE contains a computer program that accepts a sample digitized speech waveform X, and compares it to a preset number N of waveforms, information about which is stored in computer memory. The DSRE computer program may calculate a number, or set of numbers, that reflects the similarity of the sample waveform X to each of the N internally stored waveforms according to an analysis scheme used by the computer program. This calculated number is often referred to as a distance measure. The "distances" between the sample waveform X and each of the N internal waveforms may then be represented by an array [X, Ai], where i varies from one to N. The computer program can produce all of the distance measures between the sample waveform and the internal waveforms and determine the internal waveform yielding the best match (having the smallest distance measure) to the sample waveform X, or determine that no internal sample was a good enough match, according to criteria set by the computer program or the user.
Internal waveforms can be formed so as to enable a DSRE to accept a wide range of voices as matching, thus making the matching process speaker-independent to a greater or lesser degree. Conversely, a DSRE that derives its internal waveforms only from examples of the voice of the user is well suited primarily to that one person and may be highly speaker-dependent. In practice, however, there is no clear line between speaker-dependent and speaker-independent, because a DSRE designed to be speaker-dependent may work well with other speakers having speech characteristics similar to those of the intended user, and a DSRE designed to be speaker-independent may often require optimization for individual speakers having voices or speech characteristics far from the expected norm (such as young children, speakers having a strong regional or cultural dialect or foreign accent, or speakers having a partially dysfunctional voice as a result of illness).
There are a number of modifications of the basic scheme for discrete speech recognition that are intended to increase its scope or capabilities. In an adaptive DSRE, the performance of a speaker-independent system is optimized by having the system automatically incorporate information about the user's speech characteristics while the system is being used. Such a system is described in Paul Bamberg, el, al., Phoneme-Based Training for Large-Vocabulary Recognition in Six European Languages, Proc. of the Second Eur. Conf. on Speech Comm. and Tech. (Geneva), September 1991, at 175-81. An adaptive system can be `broadly adaptive` if it only identifies and adapts to a large class of speakers to which the speaker belongs (male vs. female; American English speaker vs. British English speaker) or `narrowly adaptive` if it adapts to the detailed speech characteristics of the speaker.
In another modification of the basic scheme, to reduce the number of internal models that must be stored in the computer as the vocabulary is increased, models for reoccurring parts of a word, such as the phonemes (letter-sounds, to a first approximation) or short phoneme sequences, are stored, and complete words or phrase models are constructed from these segment models. There is generally a resulting decrease in recognition accuracy which may be acceptable in some applications. Paul Bamberg, et al., Phoneme-Based Training for Large-Vocabulary Recognition in Six European Languages, Proc. of the Second Eur. Conf. on Speech Comm. and Tech. (Geneva), September 1991, at 175-81; Kai-Fu Lee, Context-Dependent Phonetic Hidden Markov Models for Speaker-Independent Continuous Speech Recognition, IEEE Transactions on Acoustical Speech and Signal Processing, April 1990, at 599-609. Some commercial DSREs include both whole-word/phrase models and constructed models. See U.S. Pat. No. 5,315,689.
In yet another modification, often referred to as `word-spotting`, the DSRE is designed to correctly match an input speech waveform segment to an appropriate internally-stored pattern even when the input waveform segment is embedded in a longer segment of continuous speech. Jay G. Wilpon, et al, Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models, IEEE Transactions on Acoustical Speech and Signal Processing, November 1990, at 1870-78; Hirohjuki Tsuboi, et al., A Real-Time Task-Oriented Speech Understanding System Using Keyword Spotting, IEEE Transactions on Acoustical Speech and Signal Processing, September 1992, at 197-200.
In describing this invention, the term "discrete speech recognition engine" or "DSRE" shall be identified with the basic whole-word/phrase implementation, although other modifications such as the above could be employed as the technology improves.
A DSRE for which a user can specify some or all of the lexical items (speech segments) which the DSRE must differentiate should also be able to compute (N.sup.2 --N) distance measures between the N internal models corresponding to these lexical items: {[A1, A2], [A1, A3], . . . , [A2, A3], [A2, A4], . . . , [AN-1, AN]} since this set of internal distance measures can be used to determine whether the internal waveforms are separated sufficiently by the DSRE analysis scheme. For example, in using a DSRE for control purposes, the user might desire the English commands "go" and "no" as lexical items. However, the analysis scheme used may not distinguish adequately between the internal samples representing these two speech segments; that is, the distance measure [A(go), A(no)] might be very small. The user could then substitute alternate terms with a roughly equivalent meaning, such as "start" or "proceed" for "go" or "negative" for "no."
In the applications mentioned above, there is at any one time a specified set of allowable utterances which the DSRE is required to differentiate. However, this set of allowable utterances can be made changeable by the computer system in which the DSRE is embedded. For example, at one time the specified set might be based on the voice and selections of a first user while at another time it might be based on a second user. Unlike the system disclosed below, however, in such previous systems it is the case that, given the set active at any specific time, the speaker can choose any utterance from the set, with all such utterances equally valid.
When used for language learning, a conventional DSRE has a number of significant restrictions:
(1) The DSRE can only respond to a predetermined set of isolated words, phrases, or sentences. The computer program considers a sample phrase or sentence as a complete "word," and the program does not analyze linguistically the sample waveform X into component meanings.
(2) The vocabulary, though often programmable by the user to some extent, must be very limited in size to obtain good performance in a speaker-independent application. For example, a vocabulary of 100 words can strain the capabilities of the typical DSRE designed for a personal computer, however, a total vocabulary of that size would be restrictive for language instruction, especially if each complete phrase or sentence to be recognized must be considered a separate "word."
(3) To minimize errors and unrecognized speech segments, speaker-independent DSREs require the speakers to already know the language well enough to produce a voice pattern that matches the voices used to create the internal samples of the computer program. Thus, if a student first learning a new language used a conventional DSRE in a manner consistent with the present state-of-the-art, the computer program would often be unable to properly recognize the students' sample speech waveforms. This inability cannot be readily cured by the use of a speaker-dependent DSRE trained using the unselected speech patterns of the individual student or by the use of a narrowly adaptive DSRE, as defined above. Though strong speaker-dependence would reduce identification errors and unrecognized speech segments, it would not be pedagogically effective, since it would tend to build the mistakes of the student into the teaching program.
(4) To achieve near-real-time performance, it is necessary for a DSRE to ignore many aspects of the input speech waveform that would normally be considered important to good pronunciation. For example, variations in voice pitch are usually largely ignored in DSREs designed for the recognition of spoken English, because these variations are not very important in signaling differences in word meaning for the English language. (This is not true for Mandarin Chinese and certain other languages.) Also, such programs tend to differentiate poorly between certain close-sounding consonants or vowels and, in order to make the DSRE less sensitive to the rate-of-speech, current analysis schemes ignore many durational features of the individual sounds or words, even though such features may be important to the phonology (pronunciation patterns) of the language.
A problem considered by the present apparatus and method is that of employing a DSRE having some or all of the above limitations as a component in an interactive system for teaching or improving oral language skills. In particular, this application presents an apparatus and method for using a DSRE having such limitations to perform speech segment recognition, speech understanding, and speech analysis functions useful for the improvement of oral language skills.