The present invention relates generally to methods and apparatus for processing audio, and more particularly to a method and apparatus for processing audio to improve the listening experience for a broad range of listeners, including hearing impaired listeners.
As one ages and progresses through life, over time due to many factors, such as age, genetics, disease, and environmental effects, one's hearing becomes compromised. Usually, the deterioration is specific to certain frequency ranges.
In addition to permanent hearing impairments, one may experience temporary hearing impairments due to exposure to particular high sound levels. For example, after target shooting or attending a rock concert one may have temporary hearing impairments that improve somewhat, but over time may accumulate to a permanent hearing impairment. Even lower sound levels than these but longer lasting may have temporary impacts on one's hearing, such as working in a factory or teaching in a elementary school.
Typically, one compensates for hearing loss or impairment by increasing the volume of the audio. But, this simply increases the volume of all audible frequencies in the total signal. The resulting increase in total signal volume will provide little or no improvement in speech intelligibility, particularly for those whose hearing impairment is frequency dependent.
While hearing impairment increases generally with age, many hearing impaired individuals refuse to admit that they are hard of hearing, and therefore avoid the use of devices that may improve the quality of their hearing. While many elderly people begin wearing glasses as they age, a significantly smaller number of these individuals wear hearing aids, despite the significant advances in the reduction of the size of hearing aids. This phenomenon is indicative of the apparent societal stigma associated with hearing aids and/or hearing impairments. Consequently, it is desirable to provide a technique for improving the listening experience of a hearing impaired listener in a way that avoids the apparent associated societal stigma.
Most audio programming, be it television audio, movie audio, or music can be divided into two distinct components: the foreground and the background. In general, the foreground sounds are the ones intended to capture the audiences attention and retain their focus, whereas the background sounds are supporting, but not of primary interest to the audience. One example of this can be seen in television programming for a “sitcom,” in which the main character's voices deliver and develop the plot of the story while sound effects, audience laughter, and music fill the gaps.
Currently, the listening audience for all types of audio media are restricted to the mixture decided upon by the audio engineer during production. The audio engineer will mix all other background noise components with the foreground sounds at levels that the audio engineer prefers, or at which the audio engineer understands have some historical basis. This mixture is then sent to the end user as either a single (mono) signal or in some cases as a stereo (left and right) signal, without any means for adjusting the foreground to the background.
The lack of this ability to adjust foreground relative to background sounds is particularly difficult for the hearing impaired. In many cases, programming is difficult to understand (at best) due to background audio masking the foreground signals.
There are many new digital audio formats available. Some of these have attempted to provide capability for the bearing impaired. For example, Dolby Digital, also referred to as AC-3 (or Audio Codec version 3), is a compression technique for digital audio that packs more data into a smaller space. The future of digital audio is in spatial positioning, which is accomplished by providing 5.1 separate audio channels: Center, Left and Right, and Left and Right Surround. The sixth channel, referred to as the 0.1 channel refers to a limited bandwidth low frequency effects (LFE) channel that is mostly non-directional due to its low frequencies. Since there are 5.1 audio channels to transmit, compression is necessary to ensure that both video and audio stay within certain bandwidth constraints. These constraints (imposed by the FCC) are more strict for terrestrial transmission than for DVD, currently. There is more than enough space on a DVD to provide the user with uncompressed audio (much more desirable from a listening standpoint). Video data is compressed most commonly through MPEG (moving pictures experts group) developed techniques, although they also have an audio compression technique very similar to Dolby's.
The DVD industry has adopted Dolby Digital (DD) as its compression technique of choice. Most DVD's are produced using DD. The ATSC (Advanced Television Standards Committee) has also chosen AC-3 as its audio compression scheme for American digital TV. This has spread to many other countries around the world. This means that production studios (movie and television) must encode their audio in DD for broadcast or recording.
There are many features, in addition to the strict encoding and decoding scheme, that are frequently discussed in conjunction with Dolby Digital. Some of these features are part of DD and some are not. Along with the compressed bitstream, DD sends information about the bitstream called metadata, or “data about the data.” It is basically zero's and ones indicating the existence of options available to the end user. Three of these options that are relevant to HEC are dialnorm (dialog normalization), dynrng (dynamic range), and bsmod (bit stream mode that controls the main and associated audio services). The first two are an integral part of DD already, since many decoders handle these variables, giving end users the ability to adjust them. The third bit of information, bsmod, is described in detail in ATSC document A/54 (not a Dolby publication) but also exists as part of the DD bitstream. The value of bsmod alerts the decoder about the nature of the incoming audio service, including the presence of any associated audio service. At this time, no known manufacturers are utilizing this parameter. Multiple language DVD performances are provided via multiple complete main audio programs on one of the eight available audio tracks on the DVD.
The dialnorm parameter is designed to allow the listener to normalize all audio programs relative to a constant voice level. Between channels and between program and commercial, overall audio levels fluctuate wildly. In the future, producers will be asked to insert the dialnorm parameter which indicates the level (SPL) at which the dialog has been recorded. If this value is set as 80 dB for a program but 90 dB for a commercial, the television will decode that information examine the level the user has entered as desirable (say 85 dB) and will adjust the movie up 5 dB and the commercial down 5 dB. This is a total volume level adjustment that is based on what the producer enters as the dialnorm bit value.
A section from the AC-3 description (from document A/52) provides the best description of this technology. “The dynrng values typically indicate gain reduction during the loudest signal passages, and gain increase during the quiet passages. For the listener, it is desirable to bring the loudest sounds down in level towards the dialog level, and the quiet sounds up in level, again towards dialog level. Sounds which are at the same loudness as the normal spoken dialogue will typically not have their gain changed.”
The dynrng variable provides the user with an adjustable parameter that will control the amount of compression occurring on the total volume with respect to the dialog level. This essentially limits the dynamic range of the total audio program about the mean dialog level. This does not, however, provide any way to adjust the dialog level independently of the remaining audio level.
One attempt to improve the listening experience of hearing impaired listeners is provided for in The ATSC, Digital Television Standard (Annex B). Section 6 of Annex B of the ATSC standard describes the main audio services and the associated audio services. An AC-3 elementary stream contains the encoded representation of a single audio service. Multiple audio services are provided by multiple elementary streams. Each elementary stream is conveyed by the transport multiplex with a unique PID. There are a number of audio service types which may be individually coded into each elementary stream. One of the audio service types is called the complete main audio service (CM). The CM type of main audio service contains a complete audio program (complete with dialogue, music and effects). The CM service may contain from 1 to 5.1 audio channels. The CM service may be further enhanced by means of the other services. Another audio service type is the hearing impaired service (HI). The HI associated service typically contains only dialogue which is intended to be reproduced simultaneously with the CM service. In this case, the HI service is a single audio channel. As stated therein, this dialogue may be processed for improved intelligibility by hearing impaired listeners. Simultaneous reproduction of both the CM and HI services allows the hearing impaired listener to hear a mix of the CM and HI services in order to emphasize the dialogue while still providing some music and effects. Besides providing the HI service as a single dialogue channel, the HI service may be provided as a complete program mix containing music, effects, and dialogue with enhanced intelligibility. In this case, the service may be coded using any number of channels (up to 5.1). While this service may improve the listening experience for some hearing impaired individuals, it certainly will not for those who do not employ the proscribed receiver for fear of being stigmatized as hearing impaired. Finally, any processing of the dialogue for hearing impaired individuals prevents the use of this channel in creating an audio program for non-hearing individuals. Moreover, the relationship between the HI service and the CM service set forth in Annex B remains undefined with respect to the relative signal levels of each used to create a channel for the hearing impaired.
Other techniques have been employed to attempt to improve the intelligibility of audio. For example, U.S. Pat. No. 4,024,344 discloses a method of creating a “center channel” for dialogue in cinema sound. This technique disclosed therein correlates left and right stereophonic channels and adjusts the gain on either the combined and/or the separate left or right channel depending on the degree of correlation between the left and right channel. The assumption being that the strong correlation between the left and right channels indicates the presence of dialogue. The center channel, which is the filtered summation of the left and right channels, is amplified or attenuated depending on the degree of correlation between the left and right channels. The problem with this approach is that it does not discriminate between meaningful dialogue and simple correlated sound, nor does it address unwanted voice information within the voice band. Therefore, it cannot improve the intelligibility of all audio for all hearing impaired individuals.
The separation of voice from background audio in television signals is discussed by Shiraki in U.S. Pat. No. 5,197,100. The technique employed by Shiraki involves the use of band pass filtering in combination with summing and subtracting circuits to form a “voice channel” that would be differentiated from the rest of the audio programming. The limitation of this approach is that the band pass filter only discriminates frequencies within a predetermined range, in this case 200 Hz to 500 Hz. It cannot discriminate between voice and background audio that may happen to fall within the band pass frequency. Furthermore, the application of band pass filtering cannot distinguish between relevant and irrelevant speech-components within an audio signal.
Means of reducing background noise in audio frames have been discussed by Solve et al. in U.S. Pat. No. 5,485,522 which discloses a speech detector and a noise estimator used to adaptively adjust attenuation to each frame of an audio signal. This and other forms of Adaptive Noise Filtering cannot distinguish between voice and other non-stationary audio in the voice band, such as music or irrelevant voice information. Consequently, the improvement in intelligibility is less than optimum.
Attempts to improve the listening experience by modifying the signal level in the face of large noise variations have been made. For example, U.S. Pat. No. 5,434,922 to Miller et al. discloses a method and system for sound optimization which measures both the music and noise in a vehicle. Miller et al. uses analog to digital conversions and adaptive filtering with algorithms to compensate for the ambient noise background by enhancing the sound signal automatically. This technique cannot compensate for a preferred audio signal that is overwhelmed by the remaining audio signal for a particular listener. In the system of Miller et al., the system merely increases the total signal level in an attempt to overcome the presence of road and engine noise. In most cases, audio that was not intelligible to a particular listener does not become intelligible by merely increasing the signal level.
In general, prior art techniques employing band pass filtering or selective equalization will not remove voice band background or noise within the voice band range from speech components of the audio program. The previously cited inventions of Dolby, Shiraki and Miller et al. have all attempted to modify some content of the audio signal through various signal processing hardware or algorithms, but those methods do not satisfy the individual needs or preferences of different listeners. In sum, all of these techniques provide a less than optimum listening experience for hearing impaired individuals as well as non-hearing impaired individuals.
Finally, in the case of studio recordings, vocals are usually recorded separately and are later mixed with the instrumentals and placed on a single recording track. The end user is therefore enabled to only adjust the volume, tone and balance (in the case of stereo), but not the relative signal levels of the voice component or the background component.
The present invention is therefore directed to the problem of developing a system and method for processing audio signals that optimizes the listening experience for hearing impaired listeners, as well as non-hearing impaired listeners, without forcing hearing impaired individuals to feel stigmatized by requiring them to employ special hearing-impaired equipment.