1. Field of the Invention
Embodiments are directed to methods and means for decoding background noise information in speech signal encoding methods.
2. Background of the Related Art
Since the beginnings of telecommunication, a limitation of bandwidth for analog voice transmission has been designated for telephone calls. Voice transmission takes place at a limited frequency range of 300 Hz to 3400 Hz.
Such a limited range of frequencies is also designated in many voice signal encoding methods for present-day digital telecommunications. To this end, prior to any encoding procedure, the analog signal's bandwidth is delimited. In the process, a codec is used for coding and decoding, which, because of the described delimitation of its bandwidth between 300 Hz and 3400 Hz, is also referred to as a narrowband speech codec in the following text. The term codec is understood to mean both the coding requirement for digital encoding of audio signals and the decoding requirement for decoding data with the goal of reconstructing the audio signal.
One example of a narrowband speech codec is known as the ITU-T Standard G.729. The transmission of a narrowband speech signal having a bit rate of 8 kbit/s is provided using the coding requirement described therein.
Moreover, so-called wideband speech codecs are known, which provide encoding in an expanded frequency range for the purpose of improving the auditory impression. Such an expanded frequency range lies, for example, between a frequency of 50 Hz and 7000 Hz. One example of a wideband speech codec is known as the ITU-T Standard G.729.EV.
Customarily, encoding methods for wideband speech codecs are configured so as to be scalable. Scalability is here taken to mean that the transmitted encoded data contain various delimited blocks, which contain the narrowband component, the wideband component, and/or the full bandwidth of the encoded speech signal. Such a scalable configuration, on the one hand, allows downward compatibility on the part of the recipient and, on the other hand, in the case of limited data transmission capacities in the transmission channel, makes it easy for the sender and recipient to adjust the bit rate and the size of transmitted data frames.
To reduce the data transmission rate by means of a codec, customarily the data to be transmitted are compressed. Compression is achieved, for example, by encoding methods in which parameters for an excitation signal and filter parameters are specified for encoding the speech data. The filter parameters as well as the parameter that specifies the excitation signal are then transmitted to the receiver. There, with the aid of the codec, a synthetic speech signal is synthesized, which resembles the original speech signal as closely as possible in terms of a subjective auditory impression. With the aid of this method, which is also referred to as the “analysis by synthesis” method, the samples that are established and digitized are not transmitted themselves, but rather the parameters that were ascertained, which render a synthesis of the speech signal possible on the receiver's side.
A method for discontinuous transmission, which is also known in the field as DTX, affords an additional way to reduce the data transmission rate. The fundamental goal of DTX is to reduce the data transmission rate when there is a pause in speaking.
To this end, the sender employs speech pause recognition (Voice Activity Detection, VAD), which recognizes a speech pause if a certain signal level is not met.
Customarily, the receiver does not expect complete silence during a speech pause. On the contrary, complete silence would lead to annoyance on the receiver's part or even to the suspicion that the connection had been interrupted. For this reason, methods are employed to produce a so-called comfort noise.
A comfort noise is a noise synthesized to fill phases of silence on the receiver's side. The comfort noise serves to foster a subjective impression of a connection that continues to exist without requiring the data transmission rate that is used for the purpose of transmitting speech signals. In other words, less energy is expended for the sender to encode the noise than to encode the speech data. To synthesize—i.e., decode—the comfort noise in a manner still perceived by the receiver as realistic, data are transmitted at a far lower bit rate. The data transmitted in the process are also referred to within the field as SID (Silence Insertion Descriptor).
In the current state of the art, problems exist with the method for discontinuous transmission using wideband speech codecs, such as ITU-T G.729.1, G.722.2 or 3GPP AMR-WB, for example. The speech codecs referred to as scalable wideband typically support different data transmission rates in a wideband range of 50 to 7000 Hz.
Possible bit rates for encoding speech information are, for instance, 8, 12, 14, 16, . . . , 32 kbit/s, which are used in Standard G.729.1, for example. The bit rates of 8 and 12 kbit/s are applied in narrowband signals (50 Hz to 4 kHz). Bit rates of more than 12 kbit/s are applied to the upper spectrum of 4 to 7 kHz.
A change between the aforementioned bit rates is possible during a transmission. A sudden change from a narrowband to a wideband bit rate is known to cause a disturbing effect to a human recipient. For instance, such a transition takes place in the sequence of a bitstream truncation, which can be caused by a transfer network between the sender and receiver, for example, in the sequence of establishing additional connections or due to congestion in the transfer network. This truncation leads to a change in the bit rate and finally to a transition from wideband to narrowband transfer of the speech signal.
If the discontinuous transmission or DTX method is used in the encoder method, a reduction of the data transmission rate for transmission of the respective data frame is possible. The DTX method is used precisely when a corresponding frame is characterized as a speech pause. Use of the DTX method achieves a reduced data transmission rate of the transmitted frame due to two factors. First, on the side of the encoder, all inactive frames do not have to be sent to the decoder. Second, a sent SID frame or inactive frame uses far fewer bits than a speech data frame.
Such a method requires involvement of voice activity detection (VAD) on the encoder side. By means of a voice activity detector, the encoder is informed as to whether a frame containing a current sampling rate and to be encoded contains a speech signal or a speech pause with background noise. Use of this characterization affects encoder actions, which ascertain the perceptional characteristics of an inactive speech frame. Such perceptional characteristics include the energy transmitted, for instance, as well as spectral and temporal characteristics.
The encoder sends a specially identified frame, an SID (Silence Insertion Descriptor) frame, to the decoder. The decoder synthesizes a comfort noise based on the information contained in the SID frame, in which the decoder can determine whether the noise information contained involves narrowband or wideband information based on the SID frame.
A change in the bit rate (Bit Rate Switching) between narrowband and wideband information is a typical scenario for every scalable wideband speech codec. Handling a bit rate switch during a normal speech phase, i.e., in the absence of speech pauses, is amply described in the literature, but handling one during entry into a DTX phase is still not yet known at this time. Therefore, an urgent need exists to provide a method for bit rate switching during a DTX phase and/or during entry into a DTX phase in order to optimally respond to a switch between a narrowband and wideband bit rate before or during the transition into the DTX phase.
During a speech pause, a truncation of the bit rate is unlikely, because the bitstream relocation of an SID frame needs fewer bits as it is than an active speech data frame in a “normal” codec operation, i.e., a codec operation during an exclusively speaking phase.
This leads to a possible scenario in which the bit rate is changed during an active speaking phase, but in speech pauses, i.e., during the DTX phase, remains in a wideband mode. Because this can be very disturbing to the human recipient on the decoder side, it is recommended in this case that the active speaking frames be decoded in narrowband and the background noise be rendered in the speech pauses in wideband.
This is more likely to occur, for instance, in situations in which the speech data frame sent on the encoder end is truncated by the transmission network, while on the side of the transmission network, there is still sufficient capacity remaining for transmission of the wideband SID frame.
As yet, no method for switching the bit rate of the SID frame during a speech pause is known. The existing method for bitstream switching applies solely to normal codec operation during an active speaking phase.