1. Field of the Invention
The invention relates to systems and methods for improving clarity and intelligibility of human speech (dialog) determined by an audio (stereo or multi-channel) signal. In a class of embodiments, the invention is a method and system for improving clarity and/or intelligibility of dialog determined by a stereo input signal by analyzing the input signal to generate filter control values, upmixing the input signal to generate a speech (center) channel and non-speech channels, filtering the speech channel in a peaking filter (steered by at least one of the control values) and attenuating the non-speech channels in a manner also steered by at least some of the control values. Preferably, the control values are generated without use of feedback in a manner including determination of power ratios for pairs of the speech and non-speech channels.
2. Background of the Invention
Throughout this disclosure including in the claims, the term “dialog” is used in a broad sense to denote human speech. Thus, “dialog” determined by an audio signal is audio content of the signal that is perceived as human speech (e.g., dialog, monologue, singing, or other human speech) upon reproduction of the signal by a speaker. In accordance with typical embodiments of the invention, the clarity and/or intelligibility of dialog (determined by an audio signal) is improved relative to other audio content (e.g., instrumental music or non-speech sound effects) determined by the signal.
Typical embodiments of the invention assume that the majority of dialog determined by an input audio signal is either center panned (in the case of a stereo input signal) or determined by the signal's center channel (in the case of a multi-channel input signal). This assumption is consistent with the convention in surround sound production according to which the majority of dialog is usually placed into only one channel (the Center channel), and the majority of music, ambient sound, and sound effects is usually mixed into all the channels (e.g., the Left, Right, Left Surround and Right Surround channels as well as the Center channel).
Thus, the center channel of a multi-channel audio signal will sometimes be referred to herein as the “speech” channel and all other channels (e.g., Left, Right, Left Surround, and Right Surround) channels of the signal will sometimes be referred to herein as “non-speech” channels. Similarly, a “center” channel generated by summing the left and right channels of a stereo signal whose dialog is center panned will sometimes be referred to herein as a “speech” channel, and a “side” channel generated by subtracting such a center channel from the stereo signal's left (or right) channel will sometimes be referred to herein as a “non-speech” channel.
Throughout this disclosure including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout the disclosure including in the claims, the expression “ratio” of a first value (“A”) to a second value (“B”) is used in a broad sense to denote A/B, or B/A, or a ratio of a scaled or offset version one of A and B to a scaled or offset version of the other one of A and B (e.g., (A+x)/(B+y), where x and y are offset values).
Throughout the disclosure including in the claims, the expression “reproduction” of signals by speakers denotes causing the speakers to produce sound in response to the signals, including by performing any required amplification and/or other processing of the signals.
Human speech consists of perceived cues. As air is expelled from the lungs, the vocal cords vibrate. As the air escapes, the larynx, mouth and nose modify the acoustic energy to produce a variety of sounds. Vowels have regions of strong harmonic energy with unimpeded airflow. Approximants, fricatives and stops increasingly restrict airflow and have higher-frequency content but weaker energy than do vowels.
As people age, they often lose high-frequency sensitivity in their hearing. Persons with mild hearing loss typically hear better in the lower-frequency ranges (vowels) while hearing with difficulty in the higher-frequency ranges. They may have difficulty differentiating between words that begin with approximates, fricatives and stops. Also for persons with mild hearing loss, hearing speech in the presence of other sound (noise) becomes an issue as hearing loss generally tends to reduce the ability to localize and filter out background noise.
Various methods for processing audio signals to improve speech intelligibility are known. For example, the paper by Villchur, E., entitled “Signal Processing to Improve Speech Intelligibility for the Hearing Impaired”, 99th Audio Engineering Society Convention, September 1995, discusses a common process for compensating for mild hearing loss: boosting higher frequencies with an equalizer or shelving filter, as well as wideband compressing the speech to bring it above the threshold of hearing. As another example, Thomas, I. and Niederjohn, R., in “Preprocessing of Speech for Added Intelligibility in High Ambient Noise”, 34th Audio Engineering Society Convention, March 1968, discuss a shelving or equalization filter to assist non-impaired listeners when the speech is in the presence of noise or when listening at low levels.
It is known to filter an audio signal with a kind of equalization filter known as a peaking filter, to emphasize frequency components of the signal in a frequency range critical to intelligibility of speech, relative to frequency components of the signal outside this frequency range. For example, it is known to use a peaking filter to emphasize frequency components of an audio signal in a range centered on the 3rd formant of speech (F3) relative to frequency components outside such range. F3 can vary from approximately 2300 Hz to 3000 Hz in normal human speech.
It is also known to apply attenuation (ducking) to non-speech channels of a multi-channel audio signal, but less (or no) attenuation to the signal's speech channel.
There is a need for a method and system for filtering an audio signal to improve dialog intelligibility in an efficient manner, and in a manner implementable with low processor speed (e.g., low MIPS) requirements. Typical embodiments of the present invention achieve improved dialog intelligibility with reduced computational requirements relative to conventional methods and systems designed to improve dialog intelligibility.