The present invention is related to audio coding/decoding and, particularly, to audio coding/decoding in the context of bandwidth extension (BWE). A well known implementation of BWE is spectral bandwidth replication (SBR), which has been standardized within MPEG (Moving Picture Expert Group).
WO 00/45378 discloses an efficient spectral envelope coding using variable time/frequency resolution and time/frequency switching. An analogue input signal is fed to an A/D converter, forming a digital signal. The digital audio signal is fed to a perceptual audio encoder, where source coding is performed. In addition, the digital signal is fed to a transient detector and to an analysis filter bank, which splits the signal into its spectral representation (subband signals). The transient detector operates on the subband signals from the analysis bank or operates on the digital time domain samples directly. The transient detector divides the signal into granules and determines, whether subgranules within the granules are to be flagged as transient. This information is sent to an envelope grouping block, which specifies the time/frequency grid to be used for the current granule. According to the grid, the block combines uniformly sampled subband signals in order to obtain non-uniformly sampled envelope values. These values might be the average or, alternatively, the maximum energy for the subband samples that have been combined. The envelope values are, together with the grouping information, fed to the envelope encoder block. This block decides in which direction (time or frequency) to encode the envelope values. The resulting signals, the output from the audio encoder, the wide band envelope information, and the control signals are fed to a multiplexer, forming a serial bitstream that is transmitted or stored.
On the decoder side, a de-multiplexer restores the signals and feeds the output of the perceptual audio encoder to an audio decoder, which produces a lowband digital audio signal. The envelope information is fed from the de-multiplexer to the envelope decoding block, which, by use of control data, determines in which direction the current envelope is coded and decodes the data. The lowband signal from the audio decoder is routed to a transposition module, which generates an estimate of the original highband signal consisting of one or several harmonics from the lowband signal. The highband signal is fed to an analysis filterbank, which is of the same type as on the encoder side. The subband signals are combined in a scale factor grouping unit. By use of control data from the de-multiplexer, the same type of combination and time/frequency distribution of the subband samples is adopted as on the encoder side. The envelope information from the demultiplexer and the information from the scale factor grouping unit is processed in a gain control module. The module computes gain factors to be applied to the subband samples prior to reconstruction using a synthesis filterbank block. The output of the synthesis filterbank is thus an envelope adjusted highband audio signal. The signal is added to the output of a delay unit, which is fed with the lowband audio signal. The delay compensates for the processing time of the highband signal. Finally, the obtained digital wideband signal is converted to an analogue audio signal in a digital to analogue converter.
When sustained chords are combined with sharp transients with mainly high frequency contents, the chords have high energy in the lowband and the transient energy is low, whereas the opposite is true in the highband. The envelope data that is generated during time intervals where transients are present is dominated by the high intermittent transient energy. Typical coders operate on a block basis, where every block represents a fixed time interval. Transient detector lookahead is employed on the encoder side so that envelope data spanning across borders of blocks can be processed. This enables a more flexible selection of time/frequency resolutions.
The international standard ISO/IEC 14496-3 discloses a time/frequency grid in Section 4.6.18.3.3, which describes the number of SBR envelopes and noise floors as well as the time segment associated with each SBR envelope and noise floor. Each time segment is defined by a start time border and a stop time border. The time slot indicated by the start time border is included in the time segment, the time slot indicated by the stop time border is excluded from the time segment. The stop time border of a segment equals the start time border of the next segment in the sequence of segments. Thus, time borders of SBR envelopes within a SBR frame are decodable on a decoder side. The corresponding time grid/frequency grid is determined by the encoder.
U.S. Pat. No. 6,453,282 B1 discloses a method and device for detecting a transient in a discrete-time audio signal. An encoder comprises a time/frequency transform device, a quantization/coding device and a bitstream formatting device. The quantization/coding stage is controlled by a psycho-acoustic model stage. The time/frequency transform stage is controlled by a transient detector, where the time/frequency transform is controlled to switch over from a long window to a short window in case of a detected transient. In the transient detector, either the energy of a filtered discrete-time audio signal in the current segment is compared with the energy of the filtered discrete-time audio signal in a preceding segment or a current relationship between the energy of the filtered discrete-time audio signal in the current segment and the energy of the unfiltered discrete-time audio signal in the current segment is formed and this current relationship is compared with a preceding corresponding relationship. Whether a transient is present in the discrete-time audio signal, is detected using one and/or the other of these comparisons.
The coding of speech signals is particularly demanding due to the fact that speech comprises not only vowels, which have a predominantly harmonic content, in which the majority of the overall energy is concentrated in the lower part of the spectrum, but also contains a significant amount of sibilants. A sibilant is a type of fricative or affricate consonant, made by directing a jet of air through a narrow channel in the vocal tract towards the sharp edge of the teeth. The term sibilant is often taken to be synonymous with the term strident. The term sibilant tends to have an articulatory or aerodynamic definition involving the production of a periodic noise at an obstacle. Strident refers to the perceptual quality of intensity as determined by amplitude and frequency characteristics of the resulting sound (i.e. an auditory or possibly acoustic definition).
Sibilants are louder than their non-sibilant counterparts, and most of their acoustic energy occurs at higher frequencies than non-sibilant fricatives. [s] has the most acoustic strength at around 8.000 Hz, but can reach as high as 10.000 Hz. [∫] has the bulk of its acoustic energy at around 4.000 Hz, but can extend up to around 8.000 Hz. For the sibilants, there do exist IPA symbols, where alveolar and post-alveolar sibilants are known. There also exist whistled sibilants and, depending on the corresponding language, other related sounds.
All these sibilant consonants in speech have in common that, if immediately preceded by a vowel, a strong shift of energy from the low frequency part into the high frequency part takes place. A transient detector, which is directed to the detection of an energy increase over time might not be in the position to detect this energy shift. This, however, may not be too problematic in baseband audio coding, in which e.g. a bandwidth extension is not applied, since sibilants have a duration which is, normally, longer than transient events occurring in a very short time context. In baseband coding such as AAC coding, the whole spectrum is encoded with a high frequency resolution. Therefore, an energy shift from the low frequency portion to the high frequency portion need not necessarily be detected due to the comparatively stationary nature of sibilants in speech signals, when the length of a sibilant such as a [s] in a word “sister” is compared to the frame length of a long window function. Furthermore, the high frequency part is encoded with a high bitrate anyway.
The situation, however, becomes problematic, when sibilants occur in the context of bandwidth extension. In bandwidth extension, the low frequency portion is encoded with a high resolution/high bitrate using a baseband coder such as an AAC encoder, and the highband is encoded with a small resolution/small bitrate typically only using certain parameters such as a spectral envelope using spectral envelope values which have a frequency resolution much lower than the frequency resolution of the baseband spectrum. To state it differently, the spectral distance between two spectral envelope parameters will be higher (e.g. at least ten times) than the spectral distance between the spectral values in the lowband spectrum.
On the decoder side, a bandwidth extension is performed, in which the lowband spectrum is used to regenerate the highband spectrum. When, in such a context, an energy shift from the lowband portion to the highband portion takes place, i.e., when a sibilant occurs, it becomes clear that this energy shift will significantly influence the accuracy/quality of the reconstructed audio signal. However, a transient detector looking for an increase (or decrease) in energy will not detect this energy shift, so that spectral envelope data for a spectral envelope frame, which covers a time portion before or after the sibilant, will be affected by the energy shift within the spectrum. On the decoder side, the result will be that due to the lack of time resolution, the whole frame will be reconstructed with an average energy, in the high frequency portion, i.e., not with the low energy before the sibilant and the high energy after the sibilant. This will result in a decrease of quality of the estimated signal.