The present invention relates to apparatus and methods for providing language-independent suprasegmental analysis and audio-visual feedback of prosodic features of a user""s pronunciation.
The increasing globalization of world economies makes it essential for individuals to be able to communicate successfully in languages other than their own. For individuals to be effective communicators in a new language, it is essential to learn proper pronunciation. Intonation, stress and rhythm are key prosodic features of effective communication and are critical for comprehension. Thus, there is a need for effective pronunciation teaching aids.
Pronunciation training, however, generally is perceived as difficult because it is often hard for a student to pick up the peculiarities of pronunciation of a foreign language. Often, it is difficult for the student even to recognize his or her own mistakes. Moreover, it is often difficult to train teachers to detect errors in pronunciation and therefore provide beneficial feedback to students.
Previously known language training systems generally may be grouped into two categories: (1) systems based on speech analysis (i.e., analysis of how an utterance is pronounced); and (2) systems based on speech recognition (i.e., recognition of what is said). Some systems use a combined approach, in which the speech is partially speech analyzed and partially recognized.
Commercially available systems in the speech analysis category are: The Speech Viewer, available from IBM Corporation, White Plains, N.Y.; VisiPitch, available from Kay Corporation, Lincoln Park, N.J.; and newer versions of Language Now, available from Transparent Language, Inc., Hollis, N.H. All of these systems extract and visualize the pitch of an utterance. An overview of methods for extracting pitch is provided, for example, at pages 197-208 of Parsons, Voice and Speech Processing, McGraw-Hill Book Company (1987).
A drawback common to all of the foregoing speech analysis methods is that they extract pitch independently of its relevancy to intonation pattern. Thus, for example, such systems extract pitch even for vocal noises. These previously known systems therefore do not address crucial prosodic parameters of speech, such as rhythm, stress and syllabic structure.
Commercially available systems in the speech recognition category are those offered by: The Learning Company, Knoxville, Tenn.; Transparent Language, Inc., Hollis N.H.; Syracuse Language Systems, Inc., Syracuse, N.Y.; and IMSI, San Rafael, Calif. In addition, several companies offer speech recognition engines, including: Dragon Systems, Newton, Mass.; IBM Corporation, White Plains, N.Y.; and Lernout and Hauspie, Brussels, Belgium.
Most previously known language training systems present a sentence for a student to pronounce, record the student""s utterance, and then calculate the distance between the student""s utterance and one of a generalized native speaker. The calculated distance is presented to the student in the form of a indicator on a gauge and/or a graphical comparison of a waveform of the student""s utterance to the waveform for a native speaker.
A disadvantage common to all of these previously known language training systems is that the grading of the student""s utterance is arbitrary and non-specific. In particular, use of just a single parameterxe2x80x94the distance between the student""s and native speaker""s utterancesxe2x80x94provides little useful information. This is because a speech signal represents an intrinsically multi-parametric system, and this richness is not quantifiable using the distance method alone. Additionally, the visual feedback of providing a graphical comparison of the student""s unanalyzed waveform provides little useful information. Finally, all of the foregoing systems are language dependent.
Indeed, the literature relating to pronunciation training has recognized the shortcomings of available computer training systems for some time. See, e.g., D. M. Chun, xe2x80x9cSignal Analysis Software for Teaching Discourse Intonation,xe2x80x9d Lang. Learning and Tech., 2(1):61-77 (1998); H. Norman, xe2x80x9cSpeech Recognition: Considerations for use in Language Learning,xe2x80x9d EuroCALL ""98; and T. Van Els and K. de Bot, xe2x80x9cThe Role of Intonation in Foreign Accent,xe2x80x9d Modern Lang. J., 71:147-155 (1987).
U.S. Pat. No. 5,799,276 describes a knowledge-based speech recognition system for translating an input speech signal to text. The system described in that patent captures an input speech signal, segments it based on the detection of pitch period, and generates a series of hypothesized acoustic feature vectors that characterizes the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. A largely speaker-independent dictionary, based upon the application of phonological and phonetic/acoustic rules, is used to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices. Local and global syntactic analysis of the word choices is provided to enhance the recognition capability of the system.
In view of the foregoing, it would be desirable to provide a voice and pronunciation training system that gives more meaningful audio-visual feedback than provided by previously known systems, by providing extensive audio-visual feedback pertinent to prosodic training.
It also would be desirable to provide a voice and pronunciation training system that provides easy-to-understand visualization of intonation, stress and rhythm patterns, visualizes syllabic structure of an utterance, and enables a user to pinpoint his or her pronunciation errors.
It further would be desirable to provide a voice and pronunciation training system that is curriculum independent and may be easily customized for different curricula.
It still further would be desirable to provide a voice and pronunciation training system that is language independent, thereby enabling a student to practice intonation, stress and rhythm patterns of a foreign language using his or her native language or free forms like xe2x80x9cta-ta-taxe2x80x9d.
It also would be desirable to provide a voice and pronunciation training system that enables suprasegmental analysis and visual feedback of an utterance for deaf and semi-deaf speakers who require visual feedback during speech training to compensate for hearing imparity, and for use by speech pathologists during their work with patients.
In view of the foregoing, it is an object of this invention to provide a voice and pronunciation training system that gives more meaningful audio-visual feedback than provided by previously known systems, by providing extensive audio-visual feedback pertinent to prosodic training.
It is also an object of this invention to provide a voice and pronunciation training system that provides easy-to-understand visualization of intonation, stress and rhythm patterns, visualizes syllabic structure of an utterance, and enables a user to pinpoint his or her pronunciation errors.
It further is an object of the present invention to provide a voice and pronunciation training system that is curriculum independent and may be easily customized for different curricula.
It is another object of this invention to provide a voice and pronunciation training system that is language independent, thereby enabling a student to practice intonation, stress and rhythm patterns of a foreign language using his or her native language or free forms like xe2x80x9cta-ta-taxe2x80x9d.
It is a still further object of the present invention to provide a voice and pronunciation training system that enables suprasegmental analysis and visual feedback of an utterance for deaf and semi-deaf speakers who require visual feedback during speech training to compensate for hearing imparity, and for use by speech pathologists during their work with patients.
These and other objects of the present invention are accomplished by providing a pronunciation training system and methods that analyze a user""s utterance, extracts prosodic features and presents metrics corresponding to those features to a user in an easy-to-understand manner that is language and training curriculum independent. The pronunciation training system and methods are provided as a series of programmed routines stored on an item of removable storage media, such as an optical disk or high capacity floppy disk.
In particular, the student selects from among a plurality of pre-recorded utterances spoken by a native speaker to practice, and selects a graphical presentation mode. The student then records his or her pronunciation of the utterance. A speech analysis engine, such as described in U.S. Pat. No. 5,799,276 is used to analyze a user""s utterance to extract prosodic features. The system then computes metrics for the student""s utterance, and displays those metrics for the student""s utterance and the native speaker""s utterance on a side-by-side basis for the selected graphical presentation mode. The system also permits the student to repeat selected phrases and to monitor improvement by similarity between the graphical metrics.