1. Field of the Invention
The invention relates to systems and methods for improving intelligibility of human speech (e.g., dialog) determined by a multi-channel audio signal. In some embodiments, the invention is a method and system for filtering an audio signal having a speech channel and a non-speech channel to improve intelligibility of speech determined by the signal, by determining at least one attenuation control value indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by the non-speech channel, and attenuating the non-speech channel in response to the attenuation control value.
2. Background of the Invention
Throughout this disclosure including in the claims, the term “speech” is used in a broad sense to denote human speech. Thus, “speech” determined by an audio signal is audio content of the signal that is perceived as human speech (e.g., dialog, monologue, singing, or other human speech) upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer). In accordance with typical embodiments of the invention, the audibility of speech determined by an audio signal is improved relative to other audio content (e.g., instrumental music or non-speech sound effects) determined by the signal, thereby improving the intelligibility (e.g., clarity or ease of understanding) of the speech.
Throughout this disclosure including in the claims, the expression “speech-enhancing content” of a channel of a multi-channel audio signal is content (determined by the channel) that enhances the intelligibility or other perceived quality of speech content determined by another channel (e.g., a speech channel) of the signal.
Typical embodiments of the invention assume that the majority of speech determined by a multi-channel input audio signal is determined by the signal's center channel. This assumption is consistent with the convention in surround sound production according to which the majority of speech is usually placed into only one channel (the Center channel), and the majority of music, ambient sound, and sound effects is usually mixed into all the channels (e.g., the Left, Right, Left Surround and Right Surround channels as well as the Center channel).
Thus, the center channel of a multi-channel audio signal will sometimes be referred to herein as the “speech” channel and all other channels (e.g., Left, Right, Left Surround, and Right Surround) channels of the signal will sometimes be referred to herein as “non-speech” channels. Similarly, a “center” channel generated by summing the left and right channels of a stereo signal whose speech is center panned will sometimes be referred to herein as a “speech” channel, and a “side” channel generated by subtracting such a center channel from the stereo signal's left (or right) channel will sometimes be referred to herein as a “non-speech” channel.
Throughout this disclosure including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout the disclosure including in the claims, the expression “ratio” of a first value (“A”) to a second value (“B”) is used in a broad sense to denote A/B, or B/A, or a ratio of a scaled or offset version one of A and B to a scaled or offset version of the other one of A and B (e.g., (A+x)/(B+y), where x and y are offset values).
Throughout the disclosure including in the claims, the expression “reproduction” of signals by sound-emitting transducers (e.g., speakers) denotes causing the transducers to produce sound in response to the signals, including by performing any required amplification and/or other processing of the signals.
When speech is heard in the presence of competing sounds (such as listening to a friend over the noise of a crowd in a restaurant), a portion of the acoustic features that signal the phonemic content of the speech (speech cues) are masked by the competing sounds and are no longer available to the listener to decode the message. As the level of the competing sound increases relative to the level of the speech, the number of speech cues that are received correctly diminishes and speech perception becomes progressively more cumbersome until, at some level of competing sound, the speech perception process breaks down. While this relation holds true for all listeners, the level of competing sound that can be tolerated for any speech level is not the same for all listeners. Some listeners, e.g., those with hearing loss due to aging (presbyacusis) or those listening to a language that they acquired after puberty, are less capable of tolerating competing sounds than are listeners with good hearing or those operating in their native language.
The fact that listeners differ in their ability to understand speech in the presence of competing sounds has implications for the level at which ambient sounds and background music in news or entertainment audio are mixed with speech. Listeners with hearing loss or those operating in a foreign language often prefer a lower relative level of non speech audio than that provided by the content creator.
To accommodate these special needs, it is known to apply attenuation (ducking) to non-speech channels of a multi-channel audio signal, but less (or no) attenuation to the signal's speech channel, to improve intelligibility of speech determined by the signal.
For example, PCT International Application Publication Number WO 2010/011377, naming Hannes Muesch as inventor and assigned to Dolby Laboratories Licensing Corporation (published Jan. 28, 2010), discloses that non-speech channels (e.g., left and right channels) of a multi-channel audio signal may mask speech in the signal's speech channel (e.g., center channel) to the point that a desired level of speech intelligibility is no longer met. WO 2010/011377 describes how to determine an attenuation function to be applied by ducking circuitry to the non-speech channels in an attempt to unmask the speech in the speech channel while preserving as much of the content creator's intent as possible. The technique described in WO 2010/011377 is based on the assumption that content in a non-speech channel never enhances the intelligibility (or other perceived quality) of speech content determined by the speech channel.
The present invention is based in part on the recognition that, while this assumption is correct for the vast majority of multi-channel audio content, it is not always valid. The inventor has recognized that when at least one non-speech channel of a multi-channel audio signal does include content that enhances the intelligibility (or other perceived quality) of speech content determined by the signal's speech channel, filtering of the signal in accordance with the method of WO 2010/011377 can negatively affect the entertainment experience of one listening to the reproduced filtered signal. In accordance with typical embodiments of the present invention, application of the method described in WO 2010/011377 is suspended or modified during times when content does not conform to the assumptions underlying the method of WO 2010/011377.
There is a need for a method and system for filtering a multi-channel audio signal to improve speech intelligibility in the common case that at least one non-speech channel of the audio signal includes content that enhances the intelligibility of speech content in the audio signal's speech channel.