Many audio signals may be classified as having more speech like characteristics or more generic audio characteristics more typical of music, tones, background noise, reverberant speech, etc. Codecs based on source-filter models that are suitable for processing speech signals do not process generic audio signals as effectively. Such codecs include Linear Predictive Coding (LPC) codecs like Code Excited Linear Prediction (CELP) coders. Speech coders tend to process speech signals low bit rates. Conversely, generic audio processing systems such as frequency domain transform codecs do not process speech signals very well. It is well known to provide a classifier or discriminator to determine, on a frame-by-frame basis, whether an audio signal is more or less speech like and to direct the signal to either a speech codec or a generic audio codec based on the classification. An audio signal processer capable of processing different signal types is sometimes referred to as a hybrid core codec.
However, transitioning between the processing of speech frames and generic audio frames using speech and generic audio codecs, respectively, is known to produce discontinuities in the form of audio gaps in the processed output signal. Such audio gaps are often perceptible at a user interface and are generally undesirable. Prior art FIG. 1 illustrates an audio gap produced between a processed speech frame and a processed generic audio frame in a sequence of output frames. FIG. 1 also illustrates, at 102, a sequence of input frames that may be classified as speech frames (m−2) and (m−1) followed by generic audio frames (m) and (m+1). The sample index n corresponds to the samples obtained at time n within the series of frames. For the purposes of this graph, a sample index of n=0 corresponds to the relative time in which the last sample of frame (m) is obtained. Here, frame (m) may be processed after 320 new samples have been accumulated, which are combined with 160 previously accumulated samples, for a total of 480 samples. In this example, the sampling frequency is 16 kHz and the corresponding frame size is 20 milliseconds, although many sampling rates and frame sizes are possible. The speech frames may be processed using Linear Predictive Coding (LPC) speech coding, wherein the LPC analysis windows are illustrated at 104. A processed speech frame (m−1) is illustrated at 106 and is preceded by a coded speech frame (m−2), which is not illustrated, corresponding to the input frame (m−2). FIG. 1 also illustrates, at 108, overlapping coded generic audio frames. The generic audio analysis/synthesis windows correspond to the amplitude envelope of the processed generic audio frame. The sequence of processed frames 106 and 108 are offset in time relative to the sequence of input frames 102 due to algorithmic processing delay, also referred to herein as look-ahead delay and overlap-add delay for the speech and generic audio frames, respectively. The overlapping portions of the coded generic audio frames (m) and (m+1) at 108 in FIG. 1 provide an additive effect on the corresponding sequential processed generic audio frames (m) and (m+1) at 110. However, the leading tail of the coded generic audio frame (m) at 108 does not overlap with a trailing tail of an adjacent generic audio frame since the preceding frame is a coded speech frame. Thus the leading portion of the corresponding processed generic audio frame (m) at 108 has reduced amplitude. The result of combining the sequence of coded speech and generic audio frames is an audio gap between the processed speech frame and the processed generic audio frame in the sequence of processed output frames, as shown in the composite output frames at 110.
U.S. Publication No. 2006/0173675 entitled “Switching Between Coding Schemes” (Nokia) discloses a hybrid coder that accommodates both speech and music by selecting, on a frame-by-frame basis, between an adaptive multi-rate wideband (AMR-WB) codec and a codec utilizing a modified discrete cosine transform (MDCT), for example, an MPEG 3 codec or a (AAC) codec, whichever is most appropriate. Nokia ameliorates the adverse affect of discontinuities that occur as a result of un-canceled aliasing error arising when switching from the AMR-WB codec to the MDCT based codec using a special MDCT analysis/synthesis window with a near perfect reconstruction property, which is characterized by minimization of aliasing error. The special MDCT analysis/synthesis window disclosed by Nokia comprises three constituent overlapping sinusoidal based windows, H0(n), H1(n) and H2(n) that are applied to the first input music frame following a speech frame to provide an improved processed music frame. This method, however, may be subject to signal discontinuities that may arise from under-modeling of the associated spectral regions defined by H0(n), H1(n) and H2(n). That is, the limited number of bits that may be available need to be distributed across the three regions, while still being required to produce a nearly perfect waveform match between the end of the previous speech frame and the beginning of region H0(n).
The various aspects, features and advantages of the invention will become more fully apparent to those having ordinary skill in the art upon careful consideration of the following Detailed Description thereof with the accompanying drawings described below. The drawings may have been simplified for clarity and are not necessarily drawn to scale.