Demand for efficient digital narrowband and wideband speech coding techniques with a good trade-off between the subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communications. Until recently, telephone bandwidth constrained into a range of 200-3400 Hz has mainly been used in speech coding applications (signal sampled at 8 kHz). However, wideband speech applications provide increased intelligibility and naturalness in communication compared to the conventional telephone bandwidth. In wideband services the input signal is sampled at 16 kHz and the encoded bandwidth is in the range 50-7000 Hz. This bandwidth has been found sufficient for delivering a good quality giving an impression of nearly face-to-face communication. Further quality improvement is achieved with so-called super-wideband, in which the signal is sampled at 32 kHz and the encoded bandwidth is in the range 50-15000 Hz. For speech signals this provides a face-to-face quality since almost all energy in human speech is below 14000 Hz. This bandwidth also gives significant quality improvement with general audio signals including music (wideband is equivalent to AM radio and super-wideband is equivalent to FM radio). Higher bandwidth has been used for general audio signals with the full-band 20-20000 Hz (CD quality sampled at 44.1 kHz or 48 kHz).
A sound encoder converts a sound signal (speech or audio) into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The sound signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The sound encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective quality. The sound decoder operates on the transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is one of the best prior techniques for achieving a good compromise between the subjective quality and bit rate. This coding technique is a basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The L-sample frame is divided into smaller blocks called subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
The use of source-controlled variable bit rate (VBR) speech coding significantly improves the system capacity. In source-controlled VBR coding, the codec uses a signal classification module and an optimized coding model is used for encoding each speech frame based on the nature of the speech frame (e.g. voiced, unvoiced, transient, background noise). Further, different bit rates can be used for each class. The simplest form of source-controlled VBR coding is to use voice activity detection (VAD) and encode the inactive speech frames (background noise) at a very low bit rate. Discontinuous transmission (DTX) can further be used where no data is transmitted in the case of stable background noise. The decoder uses comfort noise generation (CNG) to generate the background noise characteristics. VAD/DTX/CNG results in significant reduction in the average bit rate, and in packet-switched applications it reduces significantly the number of routed packets. VAD algorithms work well with speech signals but may result in severe problems in case of music signals. Segments of music signals can be classified as unvoiced signals and consequently may be encoded with unvoiced-optimized model which severely affects the music quality. Moreover, some segments of stable music signals may be classified as stable background noise and this may trigger the update of background noise in the VAD algorithm which results in degradation in the performance of the algorithm. Therefore, it would be advantageous to extend the VAD algorithm to better discriminate music signals. In the present disclosure, this algorithm will be referred to as Sound Activity Detection (SAD) algorithm where sound could be speech or music or any useful signal. The present disclosure also describes a method for tonal stability detection used to improve the performance of the SAD algorithm in case of music signals.
Another aspect in speech and audio coding is the concept of embedded coding, also known as layered coding. In embedded coding, the signal is encoded in a first layer to produce a first bit stream, and then the error between the original signal and the encoded signal from the first layer is further encoded to produce a second bit stream. This can be repeated for more layers by encoding the error between the original signal and the coded signal from all preceding layers. The bit streams of all layers are concatenated for transmission. The advantage of layered coding is that parts of the bit stream (corresponding to upper layers) can be dropped in the network (e.g. in case of congestion) while still being able to decode the signal at the receiver depending on the number of received layers. Layered encoding is also useful in multicast applications where the encoder produces the bit stream of all layers and the network decides to send different bit rates to different end points depending on the available bit rate in each link.
Embedded or layered coding can be also useful to improve the quality of widely used existing codecs while still maintaining interoperability with these codecs. Adding more layers to the standard codec core layer can improve the quality and even increase the encoded audio signal bandwidth. Examples are the recently standardized ITU-T Recommendation G.729.1 where the core layer is interoperable with widely used G.729 narrowband standard at 8 kbit/s and upper layers produces bit rates up to 32 kbit/s (with wideband signal starting from 16 kbit/s). Current standardization work aims at adding more layers to produce a super-wideband codec (14 kHz bandwidth) and stereo extensions. Another example is ITU-T Recommendation G.718 for encoding wideband signals at 8, 12, 16, 24 and 32 kbit/s. The codec is also being extended to encode super-wideband and stereo signals at higher bit rates.
The requirements for embedded codecs usually ask for good quality in case of both speech and audio signals. Since speech can be encoded at relatively low bit rate using a model based approach, the first layer (or first two layers) is (or are) encoded using a speech specific technique and the error signal for the upper layers is encoded using a more generic audio encoding technique. This delivers a good speech quality at low bit rates and good audio quality as the bit rate is increased. In G.718 and G.729.1, the first two layers are based on ACELP (Algebraic Code-Excited Linear Prediction) technique which is suitable for encoding speech signals. In the upper layers, transform-based encoding suitable for audio signals is used to encode the error signal (the difference between the original signal and the output from the first two layers). The well known MDCT (Modified Discrete Cosine Transform) transform is used, where the error signal is transformed in the frequency domain. In the super-wideband layers, the signal above 7 kHz is encoded using a generic coding model or a tonal coding model. The above mentioned tonal stability detection can also be used to select the proper coding model to be used.