Field
This disclosure relates to encoding of audio signals.
Background
Transmission of audio information, such as speech and/or music, by digital techniques has become widespread, particularly in long distance telephony, packet-switched telephony such as Voice over IP (also called VoIP, where IP denotes Internet Protocol), and digital radio telephony such as cellular telephony. Such proliferation has created interest in reducing the amount of information used to transfer a voice communication over a transmission channel while maintaining the perceived quality of the reconstructed speech. For example, it is desirable to make efficient use of available system bandwidth (especially in wireless systems). One way to use system bandwidth efficiently is to employ signal compression techniques. For systems that carry speech signals, speech compression (or “speech coding”) techniques are commonly employed for this purpose.
Devices that are configured to compress speech by extracting parameters that relate to a model of human speech generation are often called audio coders, voice coders, codecs, vocoders, or speech coders, and the description that follows uses these terms interchangeably. An audio coder generally includes an encoder and a decoder. The encoder typically receives a digital audio signal as a series of blocks of samples called “frames,” analyzes each frame to extract certain relevant parameters, and quantizes the parameters to produce a corresponding series of encoded frames. The encoded frames are transmitted over a transmission channel (i.e., a wired or wireless network connection) to a receiver that includes a decoder. Alternatively, the encoded audio signal may be stored for retrieval and decoding at a later time. The decoder receives and processes encoded frames, dequantizes them to produce the parameters, and recreates speech frames using the dequantized parameters.
Code-excited linear prediction (CELP) is a coding scheme that attempts to match the waveform of the original audio signal. It may be desirable to encode frames of a speech signal, especially voiced frames, using a variant of CELP that is called relaxed CELP (“RCELP”). In an RCELP coding scheme, the waveform-matching constraints are relaxed. An RCELP coding scheme is a pitch-regularizing (PR) coding scheme, in that the variation among pitch periods of the signal (also called the “delay contour”) is regularized, typically by changing the relative positions of the pitch pulses to match or approximate a smoother, synthetic delay contour. Pitch regularization typically allows the pitch information to be encoded in fewer bits with little to no decrease in perceptual quality. Typically, no information specifying the regularization amounts is transmitted to the decoder. The following documents describe coding systems that include an RCELP coding scheme: the Third Generation Partnership Project 2 (3GPP2) document C.S0030-0, v3.0, entitled “Selectable Mode Vocoder (SMV) Service Option for Wideband Spread Spectrum Communication Systems,” January 2004; and the 3GPP2 document C.S0014-C, v1.0, entitled “Enhanced Variable Rate Codec, Speech Service Options 3, 68, and 70 for Wideband Spread Spectrum Digital Systems,” January 2007. Other coding schemes for voiced frames, including prototype waveform interpolation (PWI) schemes such as prototype pitch period (PPP), may also be implemented as PR (e.g., as described in part 4.2.4.3 of the 3GPP2 document C.S0014-C referenced above). Common ranges of pitch frequency for male speakers include 50 or 70 to 150 or 200 Hz, and common ranges of pitch frequency for female speakers include 120 or 140 to 300 or 400 Hz.*
Audio communications over the public switched telephone network (“PSTN”) have traditionally been limited in bandwidth to the frequency range of 300-3400 kilohertz (kHz). More recent networks for audio communications, such as networks that use cellular telephony and/or VoIP, may not have the same bandwidth limits, and it may be desirable for apparatus using such networks to have the ability to transmit and receive audio communications that include a wideband frequency range. For example, it may be desirable for such apparatus to support an audio frequency range that extends down to 50 Hz and/or up to 7 or 8 kHz. It may also be desirable for such apparatus to support other applications, such as high-quality audio or audio/video conferencing, delivery of multimedia services such as music and/or television, etc., that may have audio speech content in ranges outside the traditional PSTN limits.
Extension of the range supported by a speech coder into higher frequencies may improve intelligibility. For example, the information in a speech signal that differentiates fricatives such as ‘s’ and ‘f’ is largely in the high frequencies. Highband extension may also improve other qualities of the decoded speech signal, such as presence. For example, even a voiced vowel may have spectral energy far above the PSTN frequency range.