This invention relates to method of coding signals and to apparatus for storing, transmiting, receiving or reproducing signals.
A common method of storing audio signals is to use parametric coding to represent audio signals, especially at very low bit rates, typically in the region from 6 kbps to 90 kbps. Examples of the use of parametric coding used in this way are included in “Low bit rate high quality audio coding with combined harmonic and wavelet representation” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 2, pp 1045 to 1048, 1996; “Advances in Parametric Audio Coding” in Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp W99-1–W99-4, 1999; and “A 6 kbps to 85 kbps scalable audio coder” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume II, pp 877–880, 2000. In these examples, a parametric audio coder is described, in which an audio signal is represented by a model, with parameters of the model being estimated and encoded. These examples use a parametric representation of an audio signal based on decomposition of an original signal into three components: a transient component, a tonal (sinusoidal) component, and a noise component. Each component is represented by a corresponding set of parameters, as described in the three documents above. A transient component of an audio signal can be characterized as an isolated element of the audio signal which is relatively short lived, and is represented by a sharp increase in energy of the audio signal.
It has been found that having a dedicated model for the transient component of an audio signal proves to be beneficial for parts of audio signals with sharp attacks, because sinusoidal and noise models cannot easily represent such perceptually important events and poor modeling can result in audible artifacts such as a pre-echo. A pre-echo occurs when the modeling error distributes the transient event to the samples before the transient beginning and when the resulted distortion is large enough to become audible. The distribution of the modeling error to the samples before the transient beginning results from the segment-by-segment analysis of an input signal in an audio coder. If a transient occurs in the middle of an analysis segment, then either a lot of coding resources are required in order to accurately model the transient, or the modeling error distributes to the whole analysis segment. Modeling error of the samples preceding a transient is typically perceptually more apparent than at samples after the transient, because of a weaker masking from the transient event itself.
In “Residual modeling in music analysis-synthesis” from Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 2, pp 1005–1008, 1996 it is shown that transient components cannot satisfactorily be represented by sinusoidal and noise models alone.
It has been shown previously in “Robust exponential modeling of audio signals” from Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 6, pp 3581–3584, 1998, that transients can be modeled efficiently using sinusoids with exponentially modulated amplitudes (referred to below as damped sinusoids). In the text below damping coefficients can be any real number, and positive values correspond to increasing amplitudes rather than to truly decreasing amplitudes. In “Robust exponential modeling of audio signals” (see above) an audio signal was analyzed on a segment-by-segment basis and each segment was represented as a sum of damped sinusoids. A problem arises with this type of coding when a transient starts in the middle of a given segment. Compared to the case where transient starts in the beginning of a segment, the number of damped sinusoids needed to model the transient well increases considerably. If a transient is not modeled properly, the modeling error is distributed over the whole of a given segment resulting in audible pre-echoes.
In the MPEG-1 Layer III audio coding algorithm, as described in “ISO-MPEG-1 Audio: a generic standard for coding of high-quality digital audio” in the Journal of the Audio Engineering Society, Volume 42, pp 780–792, October 1994. The segmentation is defined simply by the lengths of the long and short windows.
It is an object of the present invention to address the above mentioned disadvantages. To this end the invention provides a method of coding and an apparatus for coding as defined in the independent claims. Advantageous embodiments are defined in the dependent claims.
According to a first aspect of the present invention the coding of an input signal comprises:    estimating a location of at least one transients in a time segment of the input signal;    modifying the location of the transient so that the or each transient occurs at a specified location on a predetermined time scale to obtain a modified signal; and    modeling the modified signal.
The use of restricted time segmentation in the form of a specified location on a predetermined time scale to provide the only locations for the transients advantageously reduces the number of bits needed to describe the segmentation. Also the modification procedure has lower computational cost compared to a full precision segmentation procedure.
Each transient is preferably re-located to a nearest specified location of a plurality of possible locations on the predetermined time scale.
The specified locations on the predetermined time scale may be defined by integer multiples of a predetermined minimum time segment size. The predetermined minimum time segment size may have a length in the range of approximately 1 millisecond (ms) to approximately 9 ms, most preferably in the range of approximately 4 ms to approximately 6 ms.
The use of a restricted time segmentation as described advantageously simplifies the modeling procedure significantly, if rate-distortion control is used to distribute coding resources between transient, sinusoidal and noise components of the input signal being modeled.
The modeling preferably uses damped sinusoids.
The audio signal is preferably sampled at a rate of approximately 5 to 50 kHz, most preferably 8, 16, 32, 44.1 or 48 kHz. The video signal is preferably sampled at a rate of approximately 5 to 20 MHz.
The restricted time segmentation may also be applied to tonal and/or noise components of an input signal.
The estimation of the location of transients may be carried out using an energy-based approach, preferably with a moving window method, most preferably using two sliding windows.
The use of an energy-based approach allows the advantageous estimation of both very short transients and longer transients.
The location of transients may involve the location of a beginning and an end of each transient.
Preferably each located transient is moved by a cut and paste method from its original location to begin at a location on the predetermined time scale.
The cut and paste method simply removes that part of the input signal identified as a transient and moves it to the new location. Thus the step is very simple to implement.
A remaining section of the input signal between two located and modified transients is preferably time-warped to fill the gap remaining following the relocation. The time-warp may be a lengthening or a shortening of said remaining section.
By using knowledge of sound perception, including pitch perception and temporal masking effects, the time-warping is a simple method with which to restore the remaining signal after modification of the transients.
The time-warping preferably preserves the amplitudes of edge-points of the modified signal, preferably by a band limited interpolation method.
The time-warp is preferably carried out by interpolation where the change in the fundamental frequency, f0, of the remaining section is less than approximately 0.3%, most preferably less than approximately 0.2%.
Otherwise, the remaining section is preferably split in to a first length immediately after the modified transient and a second length. Preferably, the first length is approximately 8 ms to 12 ms, most preferably approximately 10 ms. The first length is preferably interpolated if the change of fundamental frequency caused is no more than approximately 1.6% to 2.4%, most preferably no more than approximately 2%. For the second length, the change of fundamental frequency is preferably not more than about 0.16% to 0.24%, most preferably approximately 0.2%.
Where the interpolation is insufficient to fill a gap in the remaining section an overlap-add procedure is preferably used.
The modification of the location of the or each transient may be performed using a transformation into a frequency domain, preferably with a discrete cosine transform. The resulting sinusoidal representation may then be analyzed for transient locations using a Hanning window. Preferably, the Hanning window has a length of approximately 512 samples (where a sample has a length of one divided by a sampling frequency of the input signal), preferably with an overlap between Hanning windows of 256 samples.
The input signal is preferably processed by dividing the input signal into a plurality of time segments. The time segments may have a length in the range of approximately 0.5 s to 2 s, preferably a length of approximately 1 s.
Adjacent time segments are preferably arranged to overlap, preferably by approximately 5% to approximately 15% of their length, more preferably the overlap is approximately 10% of the time segment length, which overlap may be approximately 0.1 s. Where a transient is located in an overlap of the adjacent time segments, the transient location is modified in the time segment in which the transient is most centrally located.
The provision of an overlap in adjacent time segments advantageously allows the selection of the time segment in which the transient is most centrally located, or more importantly furthest from the beginning or end of the time segment.
The invention extends to decoding audio or video signals coded according to the coding of the first aspect.
An apparatus according to an embodiment of the invention may be an audio device, e.g. a solid state audio device.
All of the features disclosed herein can be combined with any of the above aspects, in any combination.
Preferred embodiments of the invention of the invention provide coding signals which coding has a more simplified analysis procedure than has previously been described, coding signals which coding has a lower computational cost than equivalent methods, and coding signals which coding results in a reduction of the number of bits needed to describe a segmented signal.
Additional side information may be included in the bitstream to dewarp the signal at the decoder side. With the appropriate dewarping, temporal misalignment of stereo signals can be avoided.