In movies and on television, dialog and narrative are often presented together with other, non-speech audio, such as music, effects, or ambiance from sporting events. In many cases the speech and non-speech sounds are captured separately and mixed together under the control of a sound engineer. The sound engineer selects the level of the speech in relation to the level of the non-speech in a way that is appropriate for the majority of listeners. However, some listeners, e.g., those with a hearing impairment, experience difficulties understanding the speech content of audio programs (having engineer-determined speech-to-non-speech mixing ratios) and would prefer if the speech were mixed at a higher relative level.
There exists a problem to be solved in allowing these listeners to increase the audibility of audio program speech content relative to that of non-speech audio content.
One current approach is to provide listeners with two high-quality audio streams. One stream carries primary content audio (mainly speech) and the other carries secondary content audio (the remaining audio program, which excludes speech) and the user is given control over the mixing process. Unfortunately, this scheme is impractical because it does not build on the current practice of transmitting a fully mixed audio program. In addition, it requires approximately twice the bandwidth of current broadcast practice because two independent audio streams, each of broadcast quality, must be delivered to the user.
Another speech enhancement method (to be referred to herein as “waveform-coded” enhancement) is described in US Patent Application Publication No. 2010/0106507 A1, published on Apr. 29, 2010, assigned to Dolby Laboratories, Inc. and naming Hannes Muesch as inventor. In waveform-coded enhancement, the speech to background (non-speech) ratio of an original audio mix of speech and non-speech content (sometimes referred to as a main mix) is increased by adding to the main mix a reduced quality version (low quality copy) of the clean speech signal which has been sent to the receiver alongside the main mix. To reduce bandwidth overhead, the low quality copy is typically coded at a very low bit rate. Because of the low bitrate coding, coding artifacts are associated with the low quality copy, and the coding artifacts are clearly audible when the low quality copy is rendered and auditioned in isolation. Thus, the low quality copy has objectionable quality when auditioned in isolation. Waveform-coded enhancement attempts to hide these coding artifacts by adding the low quality copy to the main mix only during times when the level of the non-speech components is high so that the coding artifacts are masked by the non-speech components. As will be detailed later, limitations of this approach include the following: the amount of speech enhancement typically cannot be constant over time, and audio artifacts may become audible when the background (non-speech) components of the main mix are weak or their frequency-amplitude spectrum differs drastically from that of the coding noise.
In accordance with waveform-coded enhancement, an audio program (for delivery to a decoder for decoding and subsequent rendering) is encoded as a bitstream which includes the low quality speech copy (or an encoded version thereof) as a sidestream of the main mix. The bitstream may include metadata indicative of a scaling parameter which determines the amount of waveform-coded speech enhancement to be performed (i.e., the scaling parameter determines a scaling factor to be applied to the low quality speech copy before the scaled, low quality speech copy is combined with the main mix, or a maximum value of such a scaling factor which will ensure masking of coding artifacts). When the current value of the scaling factor is zero, the decoder does not perform speech enhancement on the corresponding segment of the main mix. The current value of the scaling parameter (or the current maximum value that it may attain) is typically determined in the encoder (since it is typically generated by a computationally intensive psychoacoustic model), but it could be generated in the decoder. In the latter case, no metadata indicative of the scaling parameter would need to be sent from the encoder to the decoder, and the decoder instead could determine from the main mix a ratio of power of the mix's speech content to power of the mix and implement a model to determine the current value of the scaling parameter in response to the current value of the power ratio.
Another method (to be referred to herein as “parametric-coded” enhancement) for enhancing the intelligibility of speech in the presence of competing audio (background) is to segment the original audio program (typically a soundtrack) into time/frequency tiles and boost the tiles according to the ratio of the power (or level) of their speech and background content, to achieve a boost of the speech component relative to the background. The underlying idea of this approach is akin to that of guided spectral-subtraction noise suppression. In an extreme example of this approach, in which all tiles with SNR (i.e., ratio of power, or level, of the speech component to that of the competing sound content) below a predetermined threshold are completely suppressed, has been shown to provide robust speech intelligibility enhancements. In the application of this method to broadcasting, the speech to background ratio (SNR) may be inferred by comparing the original audio mix (of speech and non-speech content) to the speech component of the mix. The inferred SNR may then be transformed into a suitable set of enhancement parameters which are transmitted alongside the original audio mix. At the receiver, these parameters may (optionally) be applied to the original audio mix to derive a signal indicative of enhanced speech. As will be detailed later, parametric-coded enhancement functions best when the speech signal (the speech component of the mix) dominates the background signal (the non-speech component of the mix).
Waveform-coded enhancement requires that a low quality copy of the speech component of a delivered audio program is available at the receiver. To limit the data overhead incurred in transmitting that copy alongside the main audio mix, this copy is coded at a very low bitrate and exhibits coding distortions. These coding distortions are likely to be masked by the original audio when the level of the non-speech components is high. When the coding distortions are masked the resulting quality of the enhanced audio is very good.
Parametric-coded enhancement is based on the parsing of the main audio mix signal into time/frequency tiles and the application of suitable gains/attenuations to each of these tiles. The data rate needed to relay these gains to the receiver is low when compared to that of waveform-coded enhancement. However, due to limited temporal-spectral resolution of the parameters, speech, when mixed with non-speech audio, cannot be manipulated without also affecting the non-speech audio. Parametric-coded enhancement of the speech content of an audio mix thus introduces modulation in the non-speech content of the mix, and this modulation (“background modulation”) may become objectionable upon playback of the speech-enhanced mix. Background modulations are most likely to be objectionable when the speech to background ratio is very low.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.