1. Field of the Invention
The present invention relates to speech processing and analysis and more particularly to a method and apparatus for determining the presence and status of predefined articulation parameters used in generating speech data. The invention further relates to a system for displaying a sectional view of anatomical changes occurring during the speech process, based on variations in the articulatory parameters.
2. Background of the Art
The art of creating proper speech in a given language is perhaps the most complex and difficult of learned behaviors or tasks undertaken. How to speak and understand speech occupies a large part of every child's education, whether schooled or not, because speech is such an important aspect of effective communication.
However, many individuals suffer from physical or mental impairments or impediments which make it more difficult than usual to acquire and maintain "good" speaking skills. Some individuals face re-learning speech skills lost as a result of trauma. Others must acquire a new language, which requires learning new skills that often conflict with already-established speech patterns. If any of these individuals cannot acquire the ability to more effectively communicate, they may experience serious difficulty functioning in social, work, or educational situations. Often speech problems reinforce class distinctions or prejudices, and also have grave economic consequences.
It is, therefore, important to be able to assist many individuals in acquiring proper speech skills beyond the typical scholastic approach. It is also important to accomplish speech training in the most efficient or effective manner possible. Efficiency is important because frustration and boredom with training or therapy regimes can inhibit the learning process. In a sense, progress or success depends on the level of frustration. This holds true for all individuals from inherently inattentive or active children, to overly anxious adults.
However, current speech therapy or training tends to rely on techniques that are either laborious, uninvolving, or incomprehensible to the student. One primary training technique is the use of static pictures or representations of the exterior of the vocal tract to show vocalization of various sounds or phonemes. Unfortunately, students have difficulty relating such views with complex internal (and unseen) anatomical manipulations required for speech. This lack of direct correlation between muscular motion or control and sound output makes it difficult to effectively alter speech patterns.
Constant repetitive exercise with a therapist can help but still fails to overcome the correlation problem. A trained therapist relies on subjective and laborious clinical observations of the trainee or student to formulate an explanation of what the student is doing incorrectly, and what needs changing. Aside from the problem of boredom for the patient or subject, direct correlation between generated speech or sound and vocal tract manipulation is not achieved.
A variety of complex signal processing and spectral display devices have also been used by therapists to establish or record spectral patterns for use as articulation indicators. Unfortunately, spectral displays are generally so complex and signal analysis approaches require such mastery, that the subject receives no useful feedback or information.
Alternate approaches include the use of computerized spectral templates or look up tables to which speech data is compared to determine its closest fit and, therefore, the probable articulation process employed. Such approaches, however, are speaker dependant and frequently fail to correctly relate the predetermined, stored data with the speech uttered by a subject.
It is believed that people would improve or alter their speech easier or more effectively if they had a better understanding of both what an ideal articulatory process should be as well as what they are apparently doing incorrectly when they utter speech. That is, speech training can be far more effective when the subject sees a direct correlation between sounds generated and the physical processes required. For this and other reasons, such as development of speech entry systems, there has been and continues to be a significant amount of research into understanding the speech process.
Much of this research has sought to establish and quantify articulatory parameters for human speech which could be used to generally improve speech therapy and training techniques. Several signal processing techniques such as linear predictive coding and formant tracking have been developed as a result of articulation research.
The linear predictive coding (LPC) approach utilizes an idealized model of a vocal tract and computations of area functions at discrete points along the model to predict anatomical changes required for generating various sounds. However, the idealized model does not correspond with actual vocal anatomy, but is a mathematical construct that can produce anomalous operating characteristics. The LPC approach also fails to account for factors such as the variation of formant bandwidth from speaker to speaker and nasality. Therefore, this approach has proven unreliable in estimating articulator activity even for a single speaker.
The formant tracking approach determines articulation parameters based on the position of formants derived from spectral data. However, there are reliability and reproducibility problems associated with tracking formants in continuous speech data. For example, it is extremely difficult to reliably find formant peaks. In manual formant tracking, the marking of formant tracks is often based on subjective criteria which also affects reproducibility. At present, formant tracking has proven to be too unreliable to support consistent and accurate estimation of articulatory features.
All of these and other problems have limited the progress of incorporating automatic signal processing into speech training and therapy. What is needed is a method and apparatus for determining the status of articulatory parameters that operate substantially in real time. It is also desirable to have a method of determining the status of articulatory parameters that provides dynamic visual feedback, is not speaker dependent, and can accommodate a large variety of subjects.