The present invention relates generally to systems that employ the transmission of compressed digital audio and, more particularly, to systems that identify and select the loudest speaker from among several incoming bit streams. The invention is particularly suitable, for example, for use in connection with multimedia teleconferencing systems in which speech signals emanating from each of multiple speakers are compressed by linear predictive coding.
In modern telecommunications systems, audio and video information is frequently transmitted from one location to another in the form of compressed digital data representative of analog signals. Compressed digital data may be carried in binary groups referred to as packets, where each packet typically includes bits representing control information, bits comprising the data being transmitted and bits used for error detection and correction. In order to ensure that the receiving end of the system properly interprets the data provided by the transmitting end, the data must generally comply with established industry standards.
In multimedia conferencing systems, audio and video information may simultaneously be transmitted according to standard protocols under which a portion of the transmission signal represents audio information, and a portion of the signal represents video information. To generate the audio or voice portion of the transmission signal from analog speech, an analog speech signal is typically sampled and subjected to a voice coder, or "vocoder," which converts the sampled signal into a compressed digital audio signal. Often, such vocoders take the form of code excited linear predictive, or "CELP," models, which are complex algorithms that typically use linear prediction and pitch prediction to model speech signals. Compressed signals generated by CELP vocoders include information that accurately models the vocal track that created the underlying speech signal. In this way, once a CELP-coded signal is decompressed, a human ear may more fully and easily appreciate the associated speech signal.
While CELP vocoders range in degree of efficiency, one of the most efficient is that defined by the G.723.1 standard, as published by the International Telecommunication Union, the entirety of which is incorporated herein by reference. Generally speaking, G.723.1 works by partitioning a 16 bit PCM representation of an original analog speech signal into consecutive segments of 30 ms length and then encoding each of these segments as frames of 240 samples. Each G.723.1 frame consists of either 20 or 24 bytes, depending on the selected transmission rate. By design, G.723.1 may operate at a transmission rate of either 5.3 kilobits per second or 6.3 kilobits per second. A transmission rate of 5.3 kilobits per second would permit 20 bytes to represent each 30 millisecond segment, whereas a transmission rate of 6.3 kilobits per second would permit 24 bytes to represent each 30 millisecond segment.
Each G.723.1 frame is further divided into four sub-frames of 60 samples each. For every sub-frame, a 10th order linear prediction coder (LPC) filter is computed using the input signal. The LPC coefficients are used to create line spectrum pairs (LSP), also referred to as LSP vectors, which describe how the originating vocal track is configured and which therefore define important aspects of the underlying speech signal. In a G.723.1 bit stream, each frame is dependent on the preceding frame, because the preceding frame contains information used to predict LSP vectors and pitch information for the current frame.
For every two G.723.1 sub-frames (i.e., every 120 samples), an open loop pitch period (OLP) is computed using the weighted speech signal. This estimated pitch period is used in combination with other factors to establish a signal for transmission to the G.723.1 decoder. Additionally, G.723.1 approximates the non-periodic component of the excitation associated with the underlying signal. For the high bit rate (6.3 kilobits per second), multi-pulse maximum likelihood quantization (MP-MLQ) excitation is used, and for the low bit rate (5.3 kilobits per second), an algebraic codebook excitation (ACELP) is used.
Like other voice coders, G.723.1 has many uses. As an example, G.723.1 is used as the audio-coder portion of two of the more common multimedia packet protocols, H.323 and H.324. The H.323 protocol defines packet standards for multimedia communications over local area networks (LANs). The H.324 protocol defines packet standards for teleconference communications over analog POTS (plain old telephone service) lines. H.323 and H.324 are frequently used to compress audio and video information transmitted in multimedia video conferencing systems. However, these packet protocols may equally be used in other contexts, such as Internet-based telephony. For audio-only applications, the video portion of the coding may be excluded, while maintaining the work of the audio coder such as G.723.1.
Generally speaking, teleconferencing involves multiple speakers and therefore requires a mechanism to distribute to each speaker one or more signals arising from the other speakers. For this purpose, an audio bridge is typically provided. In its most trivial form, an audio bridge may receive signals from each speaker and forward those signals to each of the other speakers. For instance, given speakers A, B and C each generating G.723.1 bit steams, the audio bridge may send the streams from A and B to C, the streams from A and C to B, and the streams from B and C to A. While this system may work well in the presence of few conference participants, it will be appreciated that the system would require increased bandwidth as the number of participants increases.
In a more advanced form, an audio bridge may decode each of the incoming G.723.1 bit streams and then, based on the underlying PCM signals, re-encode an output G.723.1 bit stream to distribute to each of the conference participants. For example, the audio bridge may decode all of the incoming bit streams and mix together the underlying PCM signals, for example, with a standard audio mixer. The audio bridge may then re-encode the composite signal and send the re-encoded signal to all of the participants. As will be appreciated, however, this task may become computationally expensive, especially as the number of conference participants increase. Therefore, as the number of likely participants increases, this option becomes less desirable.
As an alternative, the audio bridges in existing teleconferencing systems customarily select only the loudest incoming signal, or group of loudest incoming signals, to send to each of the conference participants. As an example, an audio bridge may decode all of the incoming bit streams and then measure the amplitudes of the PCM signals. Based on this measurement, the bridge may select, say, the top three loudest signals, mix those signals together and re-encode the composite analog signal into an outgoing G.723.1 bit stream for distribution to all of the participants.
Alternatively, as is most customary, the system may be configured to send only the speech signal of the loudest party to each of the participants. Distributing only the loudest speech signal beneficially maintains symmetric bandwidth and increases intelligibility. More specifically, by distributing only the loudest speech signal, the transmission lines carry signals of about equal bandwidth both to and from the participants. Additionally, each participant will generally hear only the loudest of the speech signals and will therefore be able to more readily ascertain what is being conveyed.
To perform this function, a typical audio bridge decodes each G.723.1 stream of data received from each speaker. The audio bridge then analyzes the underlying PCM signal in order to determine an energy level of the signal. By next comparing the estimated energy levels of the respective analog signals, the bridge may select the loudest speaker. The bridge then re-encodes the selected loudest speech signal using G.723.1 and sends the encoded signal to all of the participants. As different speakers in the conference become the loudest speaker, the audio bridge simply switches to select a different underlying PCM signal to encode as the current G.723.1 output stream.
Unfortunately, G.723.1 is a relatively complex and costly compression algorithm. Multiple operations are required to decode each frame of G.723.1 data into the underlying 30 milliseconds of audio. Further, as with any lossy compression algorithm, every useful compression/decompression cycle will always result in some loss of signal quality. This is particularly the case with respect to compressed speech signals, because complete speech signals carry complex information regarding voice patterns. Therefore, each time an existing audio bridge decodes (or decompresses) a G.723.1 bit stream and re-encodes (or re-compresses) an outgoing G.723.1 bit stream, some loss of signal quality is likely to result.
In addition to G.723.1, other useful CELP coders are known to those skilled in the art. These CELP coders presently include the G.728 and G.729 protocols, although numerous other vocoders may be known or may be developed in the future. G.728 and G.729 are likely to suffer from the same deficiencies as described above with respect to G.723.1. In particular, like G.723.1, these protocols also involve computationally expensive compression algorithms and may result in degraded audio quality upon successive encode-decode cycles.
In view of these deficiencies in the existing art, there is a growing need for an improved system of selecting the loudest of several encoded audio signals represented by G.723.1 or other similar encoded bit streams.