The present invention relates to a method of and apparatus for processing at least one coded binary audio stream organized into frames. This or these streams are obtained by, on the one hand, frequency type coding algorithms using psychoacoustic characteristics of the human ear to reduce throughput and, on the other hand, a quantization of the thus-coded signals. The invention is particularly applicable when no bit allocation data implemented during the quantization is explicitly present in the audio streams considered.
One of the main problems to be resolved in processing coded audio streams is reducing the computing cost for such processing. Generally, such processing is implemented in the time domain so it is necessary to convert audio streams from the frequency domain to the time domain then, after processing the time streams, convert back from the time domain to the frequency domain. These conversions cause algorithmic times and greatly increase computing costs, which might be onerous.
In particular, in the case of teleconferencing, attempts have been made to reduce overall communication time and thus increase its quality in terms of interactivity. The problems mentioned above are even more serious in the case of teleconferencing because of the high number of accesses that a multipoint control unit might provide.
For teleconferencing, audio streams can be coded using various kinds of standardized coding algorithms. Thus, the H.320 standard, specific to transmission on narrow band ISDN, specifies several coding algorithms (G.711, G.722, G.728). Likewise, standard H.323 so specifies several coding algorithms (G.723.1, G.729 and MPEG-1).
Moreover, in high-quality teleconferencing, standard G.722 specifies a coding algorithm that operates on a 7 kHz bandwidth, subdividing the spectrum into two subbands. ADPCM type coding is then performed for the signal in each band.
To solve the problem and the complexity introduced by the banks of quadrature mirror filters, at the multipoint control unit level, Appendix I of Standard G.722 specifies a direct recombination method based on subband signals. This method consists of doing an ADPCM decoding of two samples from the subbands of each input frame of the multipoint control unit, summing all the input channels involved and finally doing an ADPCM coding before building the output frame.
One solution suggested to reduce complexity is to restrict the number of decoders at the multipoint control unit level and thus combine the coded audio streams on only a part of the streams received. There are several strategies for determining the input channels to consider. For example, combination is done on the Nxe2x80x2 signals with the strongest gains, where Nxe2x80x2 is predefined and fixed, and where the gain is read directly from input code words. Another example is doing the combining only on the active streams although the number of inputs considered is then variable.
It is to be noted that these approaches do not solve the time reduction problem.
The purpose of this invention is to provide a new and improved method of and apparatus for processing at least one coded binary audio stream making it possible to solve the problems mentioned above.
Such a process can be used to transpose an audio stream coded at a first throughput into another stream at a second throughput. It can also be used to combine several coded audio streams, for example, in an audio teleconferencing system.
A possible application for the process of this invention involves teleconferencing, mainly, in the case of a centralized communication architecture based on a multipoint control unit (MCU) which plays, among other things, the role of an audio bridge that combines (or mixes) audio streams then routes them to the terminals involved.
It will be noted, however, that the method and apparatus of this invention can be applied to a teleconferencing system whose architecture is of the mesh type, i.e., when terminals are point-to-point linked.
Other applications might be envisaged, particularly in other multimedia contexts. This is the case, for example, with accessing database servers containing audio objects to construct virtual scenes.
Sound assembly and editing, which involves manipulating one or more compressed binary streams to produce a new one is another area in which this invention can be applied.
Another application for this invention is transposing a stream of audio signals coded at a first throughput into another stream at a second throughput. Such an application is interesting when there is transmission through different heterogeneous networks where the throughput must be adapted to the bandwidth provided by the transmission environment used. This is the case for networks where service quality is not guaranteed (or not reliable) or where allocation of the bandwidth depends on traffic conditions. A typical example is the passage from an Intranet environment (Ethernet LAN at 10 Mbits/s, for example) where the bandwidth limitation is less severe, to a more saturated network (Internet). The new H.323 teleconferencing standard allowing interoperability among terminals on different kinds of networks (LAN for which QoS is not guaranteed, NISDN, BISDN, GSTN, . . . ) is another application area. Another interesting case is when audio servers are accessed (audio on demand, for example). Audio data are often stored in coded form but with a sufficiently low compression rate to maintain high quality, since transmission over a network might need another reduction in throughput.
The invention thus concerns a method of and apparatus for processing at least one coded binary audio stream organized as frames formed from digital audio signals which were coded by first converting them from the time domain to the frequency domain in order to calculate transform coefficients then quantizing and coding these transform coefficients based on a set of quantizers determined from a set of selection parameters that are used to select said quantizers, said selection parameters being also present in the frames.
Said method comprises: 1) a step of recovering the transform coefficients which comprises a decoding step and a dequantifying step for decoding and then dequantify the frames based on a set of quantifiers as determined from said selection parameters included in said frames of at least said coded binary audio stream, 2) a step of processing the transform coefficients thus recovered in the frequency domain and, 3) a step of supplying the processed frames to a subsequent utilization step.
According to a first implementation mode, the subsequent utilization step, called recoding step, partially recodes the frames thus processed in a step involving requantization and then recoding of the thus-processed transform coefficients.
According to another characteristic of the invention, the processing step 2) involves summing the transform coefficients produced by the recovering step 1) from the different audio streams and said recoding step involves requantizing, and then recoding the summed transform coefficients.
This described process can be performed in processing stages of a multi-terminal teleconferencing system. In such a case, The processing step 2) involves summing the transform coefficients produced by the recovering step 1) from the different audio streams, said recoding step involves, for a given terminal, subtracting the transform coefficient from said terminal to the summed transform coefficients, and requantizing and then recoding the resulting transform coefficients.
According to another implementation mode of the invention, the subsequent utilization step is a frequency domain to time domain conversion step for recovering the audio signal. Such a conversion process is performed, for example, in a multi-terminal audioconferencing system. The processing step involves summing the transform coefficients produced by the partial decoding of the frame streams coming from said terminals.
According to another characteristic of the invention, the values of the selection parameters of a set of quantizers are subjected to the processing step.
When the selection parameters of the set of quantizers contained in the audio frames of the stream or of each stream represent energy values of audio signals in predetermined frequency bands (the set of these values is called the spectral envelope), the said processing step includes, for example, summing the transform coefficients respectively produced by the recovering step of the different frame streams and supplying, re-coding step, the result of the said summation. The total energy in each frequency band is then determined by summing the energies of the frames and providing, at the recoding stage, the result of the summation.
When implemented in a multi-terminal audioconferencing system, the processing step involves (1) summing the transform coefficients produced by the partial decoding of each of the frame streams respectively coming from the terminals and (2) supplying to the recoding step associated with a terminal the result of the summing, (3) subtracting to this summing the transform coefficients produced by the partial decoding of the frame stream coming from the said terminal, (4) determining the total energy in each frequency band by summing the energies of the frames coming from the terminals, and (5) supplying to the recoding step associated with a terminal the result of the summation from which the energy indication derived by the frame coming from the said terminal is subtracted.
According to another characteristic of the invention, in which the audio frames of the stream or of each stream contain information about the voicing of the corresponding audio signal, the processing step then determines voicing information for the audio signal resulting from the processing step. To determine this voicing information for the audio signal resulting from the processing step, if all the frames of all the streams have the same voicing state, the processing step considers this voicing state as the audio signal state resulting from the processing step. To determine this voicing information for the audio signal resulting from the processing, if all the frames of all the streams do not have the same voicing state, the processing step determines the total energy of the set of audio signals of the voicing frames and the energy of the set of audio signals of the unvoiced frames and considers the voicing state of the set with the greatest energy as being the voicing state of the audio signal resulting from such processing step.
When the audio frames of the stream or of each stream contain information about the tone of the corresponding audio signal, the processing determines if all the frames are of the same kind. In such a case, information about the tone of the audio signal resulting from the processing is indicated by the state of the signals of the frames.
According to another characteristic of the invention, there is a search among all the frames to be processed for the frame with the greatest energy in a given band. The coefficients of the output frame are made equal to the coefficient of the frame in said band if the coefficients of input frames other than the one with the greatest energy in a given band are masked by a masking threshold of the frame in said band. The energies of the output frame in the band are, for example, made equal to the greatest energy of the input frame in said band.
According to another characteristic of the invention, when the requantization step is a vector quantization step using embedded dictionaries, the codeword of an output band is chosen equal to the codeword of the corresponding input band, if the dictionary related to the corresponding input band is included in the dictionary selected for the output band. In the opposite case, i.e., when the dictionary selected for the output band is included in the dictionary related to the input band, the codeword for an output band is still chosen equal to the codeword of the corresponding input band, if the quantized vector for the output band belongs also to the dictionary related to the input band, else the quantized vector related to the corresponding input band is dequantized and the dequantized vector is requantized by using the dictionary selected for the output band.
For example, the requantization step is a vectorial quantization with embedded dictionaries; the dictionaries are composed of a union of permutation codes. Then, if the corresponding input dictionary for the band is included in the selected output dictionary, or in the opposite case where the output dictionary is included in the input dictionary but the quantized vector, an element of the input dictionary, is also an element of the output dictionary, the code word for the output band is set equal to the code word for the input band. Otherwise reverse quantization, then requantization, in the dictionary process is performed. The requantization procedure is advantageously sped up in that the closest neighbor of the leader of a vector of the input dictionary is a leader of the output dictionary.
The characteristics of the above-mentioned invention, and others, will become clearer upon reading the following description of preferred embodiments of the invention as related to the attached drawings.