1. Technical Field
The present invention relates to methods and apparatus for converting acoustic signals to signals capable of being perceived by tactile sensing. More particularly, the present invention relates to wearable tactile devices for deaf individuals who have essentially no useful input from their own damaged auditory system. The primary utility of the present invention is to provide a means whereby such deaf individuals have access to the world of sound, both for communication purposes and, in the special case of prelingually deafened children (i.e. those either being born deaf or losing their hearing before attaining significant language capability), for learning speech and language at as early an age as possible and in as nearly natural a way as possible. As such, it is clear that a great emphasis for devices in this class must be on cosmetics and ease of use, as well as processing methods, since continuous wearing is a necessary requirement if the above goals are to be met.
2. Discussion of the Prior Art
Before discussing the prior art in detail, it is important to note that, in addition to tactile sensing, there is a somewhat related class of devices called cochlear implants. As the name implies, these are devices having a stimulator implanted directly into the inner ear of a deaf individual for the purpose of obtaining some degree of hearing restoration. Generally speaking, an important distinction between tactile devices and cochlear implant devices is that cochlear implants attempt to restore some degree of the lost sense of hearing while tactile devices attempt to substitute the tactile sensory mode for the lost hearing sense. Hence, the transducers for both types of devices differ markedly, and the processing must also differ in order to accommodate the different transducer functions and different sensory modes.
Tactile aid technology dates back to 1927; however, only developments since 1986 are relevant since it is within that time frame that devices with the necessary compactness and convenience to be worn continuously have made their appearance. All such devices presently in use have only one or at most two channels in spite of the fact that it has long been recognized that it is desirable to use multiple channels, each functioning in some way to analyze speech waveforms and other acoustical events according to one of a number of possible paradigms, and to present detailed information about the sound events via a multiplicity of skin exciters. It is generally understood that only in this way can enough detailed information be presented to a user sufficiently clearly and uniquely to enable him to eventually learn to recognize and appreciate the sound events in a manner analogous to that by which the intact hearing system receives, analyzes and appreciates the same sound events.
Although there have been numerous attempts to provide a tactile aid according to the goals described above, a number of limiting factors have prevented success. These factors include: the relative insensitivity of the tactile sensory system, resulting in the need for large battery supplies; the limited bandwidth of the tactile sensory system (i.e., 800 Hz at most) compared to a bandwidth of at least 3500 Hz required to transmit speech data; the very poor ability of the tactile system to differentiate among different frequencies even within its narrow bandwidth range; and the relatively small dynamic range capability of the tactile system compared to a normal ear (40 db compared to over 100 db). Recent efforts in the tactile aid field have generally focused on attempts to realize solutions in each of the above-described problem areas. While it may be said that there has been some degree of success, in terms of both hardware developments and processing methods, no one has come up with a sufficiently compact and effective wearable design. The prior art has generally utilized one of the three analysis paradigms described below in designing multiple channel devices.
The first and most straightforward of these analytic techniques involves dividing the spectral band from about 100 Hz to 5000 Hz into between five and sixteen bands, detecting and filtering the output signals from each sub-band, and using the resulting envelopes to control the intensity of a group of low frequency oscillators typically running at approximately 250 Hz. Each of these amplitude modulated signals is used to drive a corresponding skin excitation transducer. While laboratory models embodying this approach have resulted in some success, the method is very inefficient in terms of power requirements. An example of a system employing this method is the Queens Vocoder, developed at Queens University in Ontario. This system is the most commonly studied laboratory instrument, but a design has never been reproduced in wearable form.
The second analytic approach is to focus only on that portion of speech commonly called "voicing" which, as is typical for this kind of processing, is displayed similarly to the above-described spectral method across some number of skin transducers, typically eight to sixteen. Voicing, in essence, is that part of the speech sound generated by glottal pulses of air from the "voice box" and is only present in voiced sounds. The reasoning behind this second approach is that it is the voicing portion of speech sound that contains information on pitch changes and inflections that are not discernible through lip reading. Hence, this second method is not of itself intended for understanding speech, but instead is an aid to lip reading. Two problems are inherent in this method. First, it is very difficult to obtain accurate information about voicing frequencies in a normal setting with background noise; and second, the two research groups using this approach have not seriously addressed the wearability issue, particularly as regards the important skin transducers. An example of this approach is contained in the U.S. Pat. No. 4,581,491 (Boothroyd).
The third analytic approach, of which the present invention is an improvement, has potential for greater efficiency and, depending on the realization technique, has greater resolution accuracy than the other two approaches. It is generically termed "feature recognition". In this approach, the processor recognizes a number of specific known characteristics of speech which are then properly formatted and tactually displayed on the skin transducers. The general method of display is broadly similar to the other two methods in that the signals are detected, encoded and displayed as fixed frequency amplitude modulated excitations of each of the transducers; however, the meaning of the display is somewhat different. In the present invention, the features on which the processor is focused are: (1) formants, characterizing and defining the different vowel sounds of speech; (2) glottal pulse rate, representing pitch information and distinguishing between so-called "voiced" and "unvoiced" sounds; and (3) dynamic temporal characteristics of the speech envelop, defining and differentiating between consonants as well as providing additional information about vowel sounds. Depending on implementation details, an additional feature, relating to frequencies in the signal having lower energy content than and not directly related to formant frequencies, can be encoded and presented in the output display in a manner to provide additional information to the user.
Although feature recognition has not been successfully embodied in tactile aids prior to the present invention, some aspects of feature recognition processing are employed in the field of cochlear implants. An example of this is found in U.S. Pat. No. 4,515,158 (Patrick et al).
The following brief description of each of speech features recognized in the third approach will be helpful for properly understanding of this present invention.
Glottal pulses have been discussed briefly above. They represent the pitch of a sound and often are called pitch rate. An example of their effect on speech and of their meaning is found in the rising inflection characterizing the end of a question. In some languages, such as Chinese, inflections completely change the meaning of an otherwise identical phrase, and while this occasionally occurs in English, it is not the general rule. An additional function of glottal pulses in speech is to distinguish between certain consonant sounds that are otherwise similar except for the presence or absence of glottal pulses. An example would be the difference between the unvoiced "t" as in tin and the voiced "d" as in din.
Formants are frequencies at which maximum energies are found in response to certain speech sounds. Generally they are associated with vowels and diphthongs and are generally three in number, although the first two (i.e., the lowest frequency two) are the most important ones. The formants occur because the glottal pulse excites certain parts of the mouth cavities formed by the teeth, tongue, larynx and nasal cavities. A great deal of work in the speech field has shown that formants, particularly the first two, can be used to classify the vowel sounds, although not perfectly. It is known to make a plot of the first two formants, plotting the first formant (F.sub.1) along the X-axis and the second formant (F.sub.2) along the Y-axis (see FIG. 7). The plane of the plot is called "vowel space". Although representation of vowel sounds in this manner results in considerable overlap, and although the vowels of different individuals vary, the definition obtained by this technique is quite good and can be used dynamically, either by computers or by the best computer of all, namely the human central nervous system, to recognize vowel sounds.
The dynamic temporal characteristics of speech envelopes; specifically amplitude, duration, attack time and release time, are often the major defining features of consonants and function to frame utterances providing important rhythm cues.
The lower energy frequency elements in speech function to shape the overall speech waveform. If information about these terms are appropriately encoded (as described hereinbelow), they can introduce changes in the output display that respond to small changes in the utterances. The net effect of this is to increase the effective resolution of the display compared to what it would be if only the formant information is extracted.
Finally, a few observations on skin transducers are in order. Initially, the only transducers available were the so-called "bone receivers" used with hearing aids to bypass defective middle ear structures. These devices are notably inefficient, particularly at the lower frequencies used in tactile aids, and also are excessively heavy and large. A transducer design developed by Audiological Engineering Corp. of Somerville, Mass., (assignee of the present invention) for skin excitation at the lower frequencies and at the lower force levels required for tactile applications resulted in a partial solution to this problem and allowed the introduction of a wearable two channel device (Tactaid II). The present invention improves on that design type and adds a number of features including a unique method for mounting a multiplicity of transducers which has never before been addressed in this field.
From the foregoing, it will be appreciated that the present invention is intended to improve on those characteristics of previously described multiple channel wearable tactile devices in several aspects, including: processing methods, power efficiency, transducer design and wearability.