1. Field of the Invention
This invention relates to a system for teaching spoken languages. More precisely, this invention relates to a system for teaching the prosodic features of a spoken language.
2. Description of the Related Technology
The information in natural spoken languages is transmitted by a combination of articulatory variations and prosodic variations (or prosody). In English and most European and middle eastern languages, the articulatory variations, often referred to as the segmental features of speech, generally determine which sounds of the language (phonemes) are being represented, and therefore also determine which words are being represented. In such languages, prosodic variables, or suprasegmental features as they are sometimes referred to, such as sound intensity (loudness), intonation (the variation of voice pitch on the perceptual level, or voice fundamental frequency on a physical level), variations in the rate of speech (usually referred to as the xe2x80x98durationxe2x80x99 variable), and voice source quality primarily determine the stress patterns in a sentence and convey emotional factors and secondarily interact with the segmental features to influence meaning. However, in the so-called tone languages such as Thai or Mandarin Chinese, the intonation pattern is also an important determinant of the word being pronounced.
If the patterns of the prosodic features in a new language differ markedly from those of a learner""s previous languages, the proper control of prosodic parameters such as intonation and intensity may be difficult to learn, especially for an adult learner. Thus the accent of a foreign speaker often includes intonation and stress pattern errors that affect naturalness and sometimes understandability. In addition, there are many persons that require coaching in their native language to improve their use of prosody during speech or singing.
To aid the learner of a new language, speakers with voice or speech problems related to prosody, or deaf learners of an oral language, a number of systems have been proposed to convert prosodic variables into a feedback display that uses a sense modality other than hearing. The variables of intonation and intensity can be measured readily from either the acoustic speech waveform or, in the case of intonation, from a contact microphone on the neck, an airflow measuring mask, or a device known as an electroglottograph. Examples of such devices are described in Martin Rothenberg, Measurement of Airflow in Speech, 20 J. Speech and Hearing Res. 155-76 (1977); Martin Rothenberg, A Multichannel Electroglottoaraph, 6 J. of Voice 36-43 (1992); and U.S. Pat. Nos. 3,345,979, 4,862,503 and 4,909,261.
Although the tactile sense modality has been used for feedback, especially in speech training for profoundly deaf learners, such as Martin Rothenberg and Richard D. Molitor, Encoding Voice Fundamental Frequency into Vibrotactile Frequency, 66 J. of the Acoustical Soc""y of Am. 1029-38 (1979), the visual sense modality is more commonly employed. Various systems for providing visual feedback for the teaching of intonation are described by Dorothy M. Chun, Teaching Tone and Intonation With Microcomputers, CALICO J., Sept. 1989, at 21-46. On a visual display, the time variable is most often encoded into a spatial variable, usually horizontal distance, and the intonation or the intensity of the voice signal is encoded into a visual variable such as vertical distance, line width, image brightness or image color value. The resulting visual display is usually displayed and held on the screen using a memory function, for post-production viewing and analysis by the speaker or a teacher and for possible comparison with a model trace or display. The state-of-the-art may include scrolling of the display if the speech segment is too long for the display, and provisions for time and amplitude normalization between the speaker""s trace and the model trace.
However, the transformations between the auditory sensations of pitch and loudness (the primary perceptual correlates of intonation frequency and acoustic intensity or amplitude, respectively) and visual variables representing intonation frequency and time, or amplitude and time, are not easy or natural for many learners. The visual display often highlights details that are irrelevant to auditory perception. These irrelevant details are sometimes referred to as microintonation. G. W. G. Spaai, et al., A Visual Display for the Teaching of Intonation to Deaf Persons: Some Preliminary Findings, 16 J. of Microcomputer Applications 277-286 (1993). In addition, a major problem in the use of such visual displays is the fact that natural speech contains many gaps in the voice stream caused by unvoiced sounds, certain occlusive consonants, and short pauses. Such microintonation and gaps in the stream of voice are readily ignored by the auditory system when judging the patterning of voice pitch or loudness in speech or singing. There have been a number of automated systems proposed for simplifying visual displays of intonation in order to circumvent these factors such as G. W. G. Spaai, et al., A Visual Display for the Teaching of Intonation to Deaf Persons: Some Preliminary Findings, 16 J. of Microcomputer Applications 277-286 (1993); S. Hiller, et al., SPELL: An Automated System for Computer-Aided Pronunciation Teaching, 13 Speech Comm. 463-73 (1993). However, a visual display constructed according to the present art remains difficult to interpret without special training.
FIG. 1 illustrates this difficulty with a trace of an intonation pattern that might be typical for the English sentence xe2x80x9cWe put the apple in.xe2x80x9d The time variable is measured on horizontal axis 50, and the intonation variable is measured on vertical axis 60.
In this system, each prosodic variable that is the subject of instruction is superimposed or modulated onto either a periodic tone waveform, or a noise-like signal, so that during a replay of this modulated tone, or noise, the user can hear the variation of the modulating variable without interference from the articulatory features of the complete voice waveform. For encoding voice intonation, or combined intonation and intensity, the effect might be similar to hearing a hummed or a single-vowel version of the original phrase or sentence. With this modulated tone or noise input, the user""s auditory system can provide the user with the sensation of continuous or smooth pitch or loudness variation even in the presence of microintonation or gaps caused by unvoiced sounds, occlusive consonants, or short pauses.
Also, by replaying this modulated-sound version of the phrase or sentence simultaneously with a visual encoding of the prosodic variable, the user can learn to interpret the visual display correctly. This learning process can be helped by a slower-than-real-time playback in some cases, so that correspondence between the variation of the auditory parameter and the visual display can be followed more easily.
It is also envisioned that the visually encoded patterns of the user can be displayed in close proximity to one or more model patterns to facilitate visual comparison. Model patterns can be recorded by an instructor or pre-stored in a system memory, or generated from a system of rules that describe the proper variation of that parameter in the language being learned. A rating of the closeness of match of the user""s patterns to the model patterns can also be made available.