This invention relates to the field of processing audio signals, such as speech signals that are compressed or encoded with a digital signal processing technique. More specifically, the invention relates to an improved method and an apparatus for coding speech signals that can be particularly useful in the field of wireless communications.
In communication applications where channel bandwidth is at a premium, it is essential to use the smallest possible portion of a transmission channel in order to transmit a voice signal. A common solution is to process the voice signal with an apparatus called a speech codec before it is transmitted on a RF channel.
Speech codecs, including an encoding and a decoding stage, are used to compress (and decompress) the digital signals at the source and reception point, respectively, in order to optimize the use of transmission channels. By encoding only the necessary characteristics of a speech signal, fewer bits need to be transmitted than what is required to reproduce the original waveform in a manner that will not significantly degrade the speech quality. With fewer bits required, lower bit rate transmission can be achieved.
Most state-of-the-art codecs are based on the original CELP model proposed by Schroeder and Atal in xe2x80x9cCode-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates,xe2x80x9d Proceedings of ICASSP, pp. 937-940, 1985. This document is hereby incorporated by reference. This basic codec model has been improved in many aspects to achieve bit rates cf approximately 8 kbits/sec and even lower, but voice quality in those with lower bit rates may not be acceptable for telephony applications. An example of an 8 kbits/sec codec is fully described in version 5.0 of the International Telecommunication Union Telecommunications Standardization Sector (ITU-TSS) Draft: recommendation G.729 xe2x80x9cCoding of speech at 8 kbits/s using Conjugate-Structure Algebraic-Code-Excited Linear-Predictive (CS-ACELP) codingxe2x80x9d, dated Jun. 8, 1995. This document is hereby incorporated by reference
Considering that lower bit rates at acceptable speech quality levels provide great economical advantages, there exists a need in the industry to provide an improved speech coding apparatus and method particularly well suited for telecommunications applications.
A general object of the invention is to provide an improved audio signal coding device, such as a Linear Predictive (LP) encoder, that achieves audio coding at low bit rates while maintaining audio quality at a level acceptable for communication applications.
In this specification, the term xe2x80x9cfilter coefficientsxe2x80x9d is intended to refer to any set of coefficients that uniquely defines a filter function that models the spectral characteristics of an audio signal. In conventional audio signal encoders, several different types of coefficients are known, including linear prediction coefficients, reflection coefficients, arcsines of the reflection coefficients, line spectrum pairs, log area ratios, among others. These different types of coefficients are usually related by mathematical transformations and have different properties that suit them to different applications. Thus, the term xe2x80x9cfilter coefficientsxe2x80x9d is intended to encompass any of these types of coefficients.
In this specification, the term xe2x80x9cexcitation segmentxe2x80x9d is defined as information that needs to be combined with the filter coefficients in order to provide a complete representation of the audio signal. Such excitation segment may include parametric information describing the periodicity of the speech signal, a residual (often referred to as xe2x80x9cexcitation signalxe2x80x9d) as computed by the encoder of a vocoder, speech framing control information to ensure synchronous framing in the decoder associated with the remote vocoder, pitch periods, pitch lags, gains and relative gains, among others.
In this specification, the term xe2x80x9csamplexe2x80x9d refers to the amplitude value at one specific instant in time of a signal. PCM (Pulse Code Modulation) is a form of coding of an analog signal that produces plurality of samples, each sample representing the amplitude of the waveform at a certain time.
The term xe2x80x9caudio signal subframexe2x80x9d refers to a set of samples that represent a portion of an audio signal such as speech. For example, in an embodiment of this invention, subframes of 40 samples were used. Also, xe2x80x9caudio signal framesxe2x80x9d are defined as a plurality of samples sets, each set being representative of a sub-frame. In a specific example, an audio signal frame has four sub-frames.
In a most preferred embodiment, the audio signal-encoding device encodes an audio signal, such as a speech signal differently in dependence upon the voiced/unvoiced characteristics of the signal. In a most preferred embodiment, the audio signal encoding device includes two signal synthesis stages, one better suited for unvoiced signals and one better suited for voiced signals. In operation, each signal synthesis stage generates a synthesized speech signal based on a set of parameters, such as filter coefficients and excitation segment computed to best approximate the input speech signal sub-frame. The two synthesized signals are compared and the one that manifests less error with respect to the input speech signal is selected as being the best match and the parameters previously computed for this synthesized signal are the ones used to form the compressed or encoded audio signal sub-frame.
The major difference between the signals produced by the voiced signal synthesis stage and the unvoiced signal synthesis stage reside in the periodicity or pitch of the signals. The synthesized voiced signal manifests a higher periodicity than the synthesized unvoiced signal.
In a specific example, the voiced signal synthesis stage comprises an adaptive codebook containing prior knowledge entries that are past audio signal sub-frames. The output of this codebook provides the periodic component of the signal generated by the voiced signal synthesis stage. Selecting an entry from a pulse stochastic codebook and passing this entry into a synthesis filter produces the aperiodic component.
The unvoiced signal synthesis stage comprises a noise stochastic codebook that issues a sample noise signal used as input to a synthesis filter. The output of the synthesis filter is the synthetic unvoiced audio signal.
In accordance with a broad aspect., the invention provides an audio signal encoding device, including an input for receiving a sub-frame of an audio signal to be encoded, an adaptive codebook and a processing unit. The adaptive codebook stores at least one prior knowledge entry, the prior knowledge entry including a data element representative of characteristics of at least a portion of a previously synthesized audio signal sub-frame. The processing unit is in operative relationship with the input and with the adaptive codebook and generates a set of parameters allowing to generate a certain synthesized audio signal sub-frame, on the basis of at least the sub-frame of the audio signal received at the input and the data element in the adaptive codebook.
In accordance with another broad aspect, the invention provides an audio signal decoding device for synthesizing a certain audio signal sub-frame from a set of parameters derived from an original audio signal sub-frame. The audio signal decoding device includes an input for receiving the set of parameters derived from the original audio signal sub-frame, an adaptive codebook and a processing unit. The adaptive codebook stores at least one prior knowledge entry including a data element representative of characteristics of at least a portion of a previously synthesized audio signal sub-frame synthesized by the audio signal decoding device The processing unit is in operative relationship with the input and with the adaptive codebook and synthesizes the certain audio signal sub-frame on a basis of at least the set of parameters received at the input and the data element in the adaptive codebook.
In accordance with another broad aspect, the invention provides a method for synthesising a certain audio signal sub-frame from a set of parameters derived from an original audio signal sub-frame. The set of parameters derived from the original audio signal sub-frame is received. An adaptive codebook in which is stored at least one prior knowledge entry is provided where the prior knowledge entry includes a data element representative of characteristics of at least a portion of a previously synthesized audio signal sub-frame. The certain audio signal sub-frame is synthesized on a basis of at least the set of parameters received and the data element in the adaptive codebook.