This invention relates to a method and apparatus for interactive language instruction. More particularly, the invention is directed to a multi-media and multi-modal computer application that displays text files for processing, provides features and functions for interactive learning, displays facial animation, and provides a workspace for language building functions. The system includes a stored set of language rules as part of the text-to-speech sub-system, as well as another stored set of rules as applied to the process of learning a language. The method implemented by the system includes digitally converting text to audible speech, providing the audible speech to a user or student (with the aid of an animated image in selected circumstances), prompting the student to replicate the audible speech, comparing the student's replication with the audible speech provided by the system, conducting performance analysis on the speech (utterance) and providing feedback and reinforcement to the student by, for example, selectively recording or playing back the audible speech and the student's replication.
While the invention is particularly directed to the art of interactive language instruction, and will be thus described with specific reference thereto, it will be appreciated that the invention may have usefulness in other fields and applications. For example, the invention may be used to teach general speech skills to individuals with speech challenges or may be used to train singers to enhance vocal skills.
By way of background, interactive language instruction programs are known. For example, U.S. Pat. No. 5,634,086 to Rtischev et al. is directed to a spoken language instruction method and apparatus employing context based speech recognition for instruction and evaluation. However, such known language instruction systems require the use of recorded speech as a model with which to compare a student's attempts to speak a language sought to be learned.
Work involved with preparing the lesson as recorded speech (such as preparing a script) includes recording phrases, words, etc., creating illustrations, photographs, video, or other media, and linking the sound files with the images and with the content of the lessons or providing large databases of alternative replies in dialogue systems which are designed to replicate interactions with students for context based lessons, etc.
Moreover, language students may be interested in learning words, phrases, and context of a particular interest such as industry specific terms from their workplace (computer industry, communications, auto repair, etc.). Producing such special content is difficult using recorded speech for the language lesson.
Other difficulties with using recorded speech in this context are numerous. The quality of the recording medium may present problems. In this regard, an excessive amount of background noise in the recording may affect the quality thereof. In addition, recorded speech is subject to many other factors that may undesirably enter the speech model. For example, recorded speech may include speaker accents resulting from the speaker being a native of a particular geographic area. Likewise, recorded speech may reflect a particular emotional state of the speaker such as whether speaker is tired or upset. As a result, in any of these circumstances, as well as others, the shortcomings of recorded speech make it more difficult for a student to learn a language lesson.
A few products exist which allow users to process files of text to be read aloud by synthesized or recorded speech technologies. These products are commonly known as text-to-speech engines. See, for example, U.S. Pat. No. 5,751,907 to Moebius et al. (issued May 12, 1998) and U.S. Pat. No. 5,790,978 to Olive et al. (issued Aug. 4, 1998), both of which are incorporated herein by reference. Some existing products also allow users to add words to a dictionary, make modifications to word pronunciations in the dictionary, or modify the sound created by a text-to-speech engine. See, for example, U.S. application Ser. No. 09/303,057, entitled “Graphical User Interface and Method for Modifying Pronunciations in Text-To-Speech and Speech Recognition Systems” which was filed Apr. 30, 1999 on behalf of August and McNerney (Case Name and No: August 23-7), which is incorporated herein by reference.
Voice or speech recognition systems are also known. These systems use a variety of techniques for recognizing speech patterns including utterance verification or verbal information verification (VIV), for which a variety of patents owned by Lucent Technologies have been applied for and/or issued. Among these commonly assigned patents/applications are U.S. Pat. No. 5,797,123 to Chou et al. (filed Dec. 20, 1996; issued Aug. 18, 1998); U.S. application Ser. No. 896,355 to B. Juang, C. Lee, Q. P. Li, and Q. Zhou filed on Jul. 18, 1997 (and corresponding application EP 892 387 A1 published on Jan. 20, 1999); U.S. application Ser. No. 897,174 to B. Juang, C. Lee, Q. P. Li, and Q. Zhou filed on Jul. 18, 1997 (and corresponding application EP 892 388 A1 published on Jan. 20, 1999); and U.S. Pat. No. 5,649,057 to Lee et al. (filed Jan. 16, 1996; issued Jul. 15, 1997), all of which are incorporated herein by this reference.
It would be desirable to have available an interactive language instruction program that did not rely exclusively on recorded speech and utilized reliable speech recognition technology, such as that which incorporates utterance verification or verbal information verification (VIV). It would also be desirable to evaluate a speaker's utterance with predictive models in the absence of a known model. The system would provide a confidence measure against any acoustic model from which a score can be derived. It would also be desirable to have available such a system that selectively incorporates facial animation to assist a student in the learning process.
The present invention contemplates a new and improved interactive language instructor which resolves the above-referenced difficulties and others.