This invention relates to audio coding systems and methods and in particular, but not exclusively, to such systems and methods for coding audio signals at low bit rates.
In a wide range of applications it is desirable to provide a facility for the efficient storage of audio signals at a low bit rate so that they do not occupy large amounts of memory, for example in computers, portable dictation equipment, personal computer appliances, etc. Equally, where an audio signal is to be transmitted, for example to allow video conferencing, audio streaming, or is telephone communication via the Internet, etc., a low bit rate is highly desirable. In both cases, however, high intelligibility and quality are important and this invention is concerned with a solution to the problem of providing coding at very low bit rates whilst preserving a high level of intelligibility and quality, and also of providing a coding system which operates well at low bit rates with both speech and music.
In order to achieve a very low bit rate with speech signals it is generally recognised that a parametric coder or xe2x80x9cvocoderxe2x80x9d should be used rather than a waveform coder. A vocoder encodes only parameters of the waveform, and not the waveform itself, and produces a signal that sounds like speech but with a potentially very different waveform.
A typical example is the LPC10 vocoder (Federal Standard 1015) as described in T. E. Tremaine xe2x80x9cThe Government Standard Linear Predictive Coding Algorithm: LPC10; Speech Technology, pp 40-49, 1982) superseded by a similar algorithm LPClOe, the contents of both of which are incorporated herein by reference. LPC10 and other vocoders have historically operated in the telephony bandwidth (0-4 kHz) as this bandwidth is thought to contain all the information necessary to make speech intelligible. However we have found that the quality and intelligibility of speech coded at bit rates as low as 2.4 Kbit/s in this way is not adequate for many current commercial applications.
The problem is that to improve the quality, more parameters are needed in the speech model, but encoding these extra parameters means fewer bits are available for the existing parameters. Various enhancements to the LPC10e model have been proposed for example in A. V. McCree and T. P. Barnwell III xe2x80x9cA Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Codingxe2x80x9d; IEEE-Trans Speech and Audio Processing Vol.3 No.4 July 1995, but even with all these the quality is barely adequate.
In an attempt to further enhance the model we looked at encoding a wider bandwidth (0-8 kHz). This has never been considered for vocoders because the extra bits needed to encode the upper band would appear to vastly outweigh any benefit in encoding it. Wideband encoding is normally only considered for good quality coders, where it is used to add greater naturalness to the speech rather than to increase intelligibility, and requires a lot of extra bits.
One common way of implementing a wideband system is to split the signal into lower and upper sub-bands, to allow the upper sub-band to be encoded with fewer bits. The two bands are decoded separately and then added together as described in the ITU Standard G722 (X. Maitre, xe2x80x9c7 kHz audio coding within 64 kbit/sxe2x80x9d, IEEE Journal on Selected Areas in Comm., vol.6, No.2, pp283-298, Feb 1988). Applying this approach to a vocoder suggested that the upper band should be analysed with a lower order LPC than the lower band (we found second order adequate). We found it needed a separate energy value, but no pitch and voicing decision, as the ones from the lower band can be used. Unfortunately the recombination of the two synthesized bands produced artifacts which we deduced were caused by phase mismatch between the two bands. We overcame this problem in the decoder by combining the LPC and energy parameters of each band to produce a single, high-order wideband filter, and driving this with a wideband excitation signal.
Surprisingly, the intelligibility of the wideband LPC vocoder for clean speech was significantly higher compared to the telephone bandwidth version at the same bit rate, producing a DRT score (as described in W. D. Voiers, xe2x80x98Diagnostic evaluation of speech intelligibilityxe2x80x99, in Speech Intelligibility and Speaker Recognition (M. E. Hawley, cd.) pp. 374-387, Dowden, Hutchinson and Ross, Inc., 1977) of 86.8 as opposed to 84.4 for the narrowband coder.
However, for speech with even a small amount of background noise, the synthesised signal sounded buzzy and contained artifacts in the upper band. Our analysis showed that this was because the encoded upper band energy was being boosted by the background noise, which during the synthesis of voiced speech boosted the upper-band harmonics, creating a buzzy effect.
On further detailed investigation we found that the increase in intelligibility was mainly a result of better encoding of the unvoiced fricatives and plosives, not the voiced sections. This led us to a different approach in the decoding of the upper band, where we synthesized only noise, restricting the harmonics of the voiced speech to the lower band only. This removed the buzz, but could instead add hiss if the encoded upper band energy was high, because of upper band harmonics in the input signal. This could be overcome by using the voicing decision, but we found the most reliable way was to divide the upper band input signal into noise and harmonic (periodic) components, and encode only the energy of the noise component.
This approach has two unexpected benefits, which greatly enhance the power of the technique. Firstly, as the upper band contains only noise there are no longer problems matching the phase of the upper and lower bands, which means that they can be synthesized completely separately even for a vocoder. In fact the coder for the lower band can be totally separate, and even be an off-the-shelf component. Secondly, the upper band encoding is no longer speech specific, as any signal can be broken down into noise and harmonic components, and can benefit from reproduction of the noise component where otherwise that frequency band would not be reproduced at all. This is particularly true for rock music, which has a strong percussive element to it.
The system is a fundamentally different approach to other wideband extension techniques, which are based on waveform encoding as in McElroy et al: Wideband Speech Coding in 7.2 KB/s ICASSP 93 pp 11-620-II-623. The problem of waveform encoding is that it either requires a large number of bits as in G722 (Supra), or else poorly reproduces the upper band signal (McElroy et al), adding a lot of quantisation noise to the harmonic components.
In this specification, the term xe2x80x9cvocoderxe2x80x9d is used broadly to define a speech coder which codes selected model parameters and in which there is no explicit coding of the residual waveform, and the term includes coders such as multi-band excitation coders (MBE) in which the coding is done by splitting the speech spectrum into a number of bands and extracting a basic set of parameters for each band.
The term vocoder analysis is used to describe a process which determines vocoder coefficients including at least LPC coefficients and an energy value. In addition, for a lower sub-band the vocoder coefficients may also include a voicing decision and for voiced speech a pitch value.
According to one aspect of this invention there is provided an audio coding system for encoding and decoding an audio signal, said system including an encoder and a decoder, said encoder comprising:
means for decomposing said audio signal into an upper and a lower sub-band signal;
lower sub-band coding means for encoding said lower sub-band signal;
upper sub-band coding means for encoding at least the non-periodic component of said upper sub-band signal according to a source-filter model;
said decoder means comprising means for decoding said encoded lower sub-band signal and said encoded upper sub-band signal, and for reconstructing therefrom an audio output signal,
wherein said decoding means comprises filter means, and excitation means for generating an excitation signal for being passed by said filter means to produce a synthesised audio signal, said excitation means being operable to generate an excitation signal which includes a substantial component of synthesised noise in a frequency band corresponding to the upper sub-band of said audio signal.
Although the decoder means may comprise a single decoding means covering both the upper and lower sub-bands of the encoder, it is preferred for the decoder means to comprise lower sub-band decoding means and upper sub-band decoding means, for receiving and decoding the encoded lower and upper sub-band signals respectively.
In a particular preferred embodiment, said upper frequency band of said excitation signal substantially wholly comprises a synthesised noise signal, although in other embodiments the excitation signal may comprise a mixture of a synthesised noise component and a further component corresponding to one or more harmonics of said lower sub-band audio signal.
Conveniently, the upper sub-band coding means comprises means for analysing and encoding said upper sub-band signal to obtain an upper sub-band energy or gain value and one or more upper sub-band spectral parameters. The one or more upper sub-band spectral parameters preferably comprise second order LPC coefficients.
Preferably, said encoder means includes means for measuring the noise energy in said upper sub-band thereby to deduce said upper sub-band energy or gain value. Alternatively, said encoder means may include means for measuring the whole energy in said upper sub-band signal thereby to deduce said upper sub-band energy or gain value.
To save unnecessary usage of the bit rate, the system preferably includes means for monitoring said energy in said upper sub-band signal and for comparing this with a threshold derived from at least one of the upper and lower sub-band energies, and for causing said upper sub-band encoding means to provide a minimum code output if said monitored energy is below said threshold.
In arrangements intended primarily for speech coding, said lower sub-band coding means may comprise a speech coder, including means for providing a voicing decision. In these cases, said decoder means may include means responsive to the energy in said upper band encoded signal and said voicing decision to adjust the noise energy in said excitation signal dependent on whether the audio signal is voiced or unvoiced.
Where the system is intended primarily for music, said lower sub-band coding means may comprise any of a number of suitable waveform coders, for example an MPEG audio coder.
The division between the upper and lower sub-bands may be selected according to the particular requirements, thus it may be about 2.75 kHz, about 4 kHz, about 5.5 kHz, etc.
Said upper sub-band coding means preferably encodes said noise component with a very low bit rate of less than 800 bps and preferably of about 300 bps.
Where the upper sub-band is analysed to obtain an energy gain value and one or more spectral parameters, said upper sub-band signal is preferably analysed with relatively long frame periods to determine said spectral parameters and with relatively short frame periods to determine said energy or gain value.
In another aspect, the invention provides a system and associated method for very low bit rate coding in which the input signal is split into sub-bands, respective vocoder coefficients obtained and then together recombined to an LPC filter.
Accordingly in this aspect, the invention provides a vocoder system for compressing a signal at a bit rate of less than 4.8 Kbit/s and for resynthesizing said signal, said system comprising encoder means and decoder means, said encoder means including:
filter means for decomposing said speech signal into lower and upper sub-bands together defining a bandwidth of at least 5.5 kHz;
lower sub-band vocoder analysis means for performing a relatively high order vocoder analysis on said lower sub-band to obtain vocoder coefficients representative of said lower sub-band;
upper sub-band vocoder analysis means for performing a relatively low order vocoder analysis on said upper sub-band to obtain vocoder coefficients representative of said upper sub-band;
coding means for coding vocoder parameters including said lower and upper sub-band coefficients to provide a compressed signal for storage and/or transmission, and
said decoder means including:
decoding means for decoding said compressed signal to obtain vocoder parameters including said lower and upper sub-band vocoder coefficients;
synthesising means for constructing an LPC filter from the vocoder parameters for said upper and lower sub-bands and re-synthesising said speech signal from said filter and from an excitation signal.
Preferably said lower sub-band analysis means applies tenth order LPC analysis and said upper sub-band analysis means applies second order LPC analysis.
The invention also extends to audio encoders and audio decoders for use with the above systems, and to corresponding methods.
Whilst the invention has been described above it extends to any inventive combination of the features set out above or in the following description.