The present invention relates to suppressing (or removing) delayed audio feedback effects (also known as echo) from a live broadcast or other transmission.
In many commonly occurring live broadcast scenarios, one or more audio signals originate from sources that are located geographically distant from a broadcast studio, and combined with audio from other distant or in-studio sources to form the broadcast program. For example, television news programs often include segments where a field reporter provides live coverage from the local scene of a newsworthy event. Sometimes, the field reporter""s commentary is interrupted or interspersed with questions from an in-studio anchor. As another example, many television and radio talk shows feature live debates between a host and various experts who are electronically conferenced from multiple separate and geographically remote studios. The term xe2x80x9cbroadcastxe2x80x9d as used herein refers to both through-air or wireless transmissions and to transmissions distributed over cable and other on-wire communication networks.
In these scenarios, it is desirable and even necessary that the local xe2x80x9cperformerxe2x80x9d monitors the actual program being broadcast from the remote studio to receive his or her xe2x80x9ccuexe2x80x9d to begin speaking, and to hear other parties speak during the program. However, a time delay is introduced as the audio signal of the performer""s voice is transmitted to the remote studio (e.g., on a land line, radio, microwave or satellite path) for mixing into the broadcast program, and a further time delay until the broadcast program transmission arrives back at the performer""s site. This time delay is due to the time for the signal to travel along the communications path, as well as delays introduced by various electronics equipment in the path (more particularly, frame synchronizers, digital compressors, and other equipment). This delay produces an echo effect that can be very disconcerting and disruptive to performers (i.e., similar to the effect experienced by a singer in a large stadium), such that the performers may find it difficult (if not impossible) to speak while monitoring the broadcast program and are forced to remove or shut off their earphone to continue their live performance.
Echo also is a problem in other applications, such as distance learning and telephone conferencing. In some distance learning applications for example, students may attend a lecture transmitted to multiple locales. Often, the audio of the lecture is a mixture of not only the lecturer""s microphone, but also of microphones at each of the locales. This allows the students at each locale to freely pose questions, and also hear questions posed by the students at the other locales. When the various microphone inputs are mixed at a central location (e.g., typically the lecturer""s site), the students will hear a delayed echo of their own voice in the lecture""s audio while posing their questions due to the transmission and other equipment delays. Telephone conferencing among multiple locations experiences a similar echo problem.
One prior solution to the echo problem in live broadcasts is to transmit a xe2x80x9cmix-minusxe2x80x9d signal from the remote studio to each local site for monitoring by the performer. At the studio, the audio signal of the performer""s voice is mixed in with other audio inputs from other sources to form the broadcast program. An inverse of the performer""s audio signal also is mixed with the broadcast program at the studio to form the mix-minus signal, which effectively cancels the performer""s voice so that the mix-minus signal contains only the contributions of all the other audio inputs except that of the performer. The performer can then monitor the mix-minus signal without experiencing disconcerting echo effects. An example of a broadcast system using such mix-minus signals is disclosed in Davis, xe2x80x9cMix-Minus Monitor System,xe2x80x9d U.S. Pat. No. 5,454,041 (1995).
A drawback to the mix-minus approach is that an additional separate transmission path for each local performer is needed to transmit their respective mix-minus signal from the studio to their respective local site. The added communications links to the local sites can add significantly to the costs of producing the live broadcast program. Further, the additional signals and communications links to the local sites add considerable complexity to setting up and running production of a live broadcast program, and can increase the chance for technical errors during the program""s production.
Various echo suppression techniques also are known and commonly used in telephone communications, particularly with conference or speaker telephones. In a typical telephone conversation, it is expected that the audio content in each direction is different. Also, it is expected that if any transmit-to-receive leakage (i.e., the xe2x80x9cechoxe2x80x9d) exists, then the level of the echo will be substantially less than the level of the original audio (typically about 15 dB of attenuation) and has a minimal delay (less than 250 milliseconds). Echo suppression devices that have been used for such telephones generally have relied on these two conditions being present. These conditions, however, do not hold true in the above-described live broadcast, distance learning and conferencing situations. Specifically, the level of the performer""s voice in the broadcast program typically is equal to or even greater than that from the performer""s microphone, and the delay often is greater than 250 milliseconds (due, for example, to the use of frame synchronizers and digital compression). The signal sent from the performer""s site to the remote studio and the program broadcast from the studio often have similar content, particularly when the performer is speaking.
A further drawback to typical echo suppression techniques is the speed at which the local microphone input can be correlated to echo in the return audio signal, particularly when the delay is unknown over a longer time interval (e.g., greater than 250 milliseconds).
The present invention provides the capability to suppress or remove echo of a local source audio signal from a remote return audio signal received at the local site, such as in the above-described live broadcast, distance learning and conferencing scenarios. The source audio signal and the return audio signal are digitally processed (e.g., in a digital signal processor running a correlation routine) to detect a time delay and level difference of any echo of the source audio signal contained in the return audio signal. An inverse of the source audio signal adjusted according to the time delay and level difference is then mixed with the return audio signal to suppress or cancel the echo from the return audio signal. The resulting echo-suppressed audio signal can then be mixed with the source audio signal (not delayed) and played on a monitoring device (e.g., a set of headphones or speakers) for comfortable listening, such as by a performer during a live broadcast.
According to one aspect of the invention, the echo suppression has a multiple state operating sequence that controls the audio signal sent to the monitoring device (e.g., headphones). The multiple state operating sequence accounts for a complex set of conditions, including that the source audio signal is not always xe2x80x9con the airxe2x80x9d and that the correlation to detect time delay and level difference requires a finite amount of time to process. For example, a broadcast program audio signal may be sent to the monitoring device in an initial operating state when a live performer is xe2x80x9coff the air.xe2x80x9d When the performer goes xe2x80x9con the air,xe2x80x9d the source audio signal without the return audio signal is played to the monitoring device during one or more states in which the correlation to the echo is sought. Then, the echo-suppressed audio signal is played to the monitoring device in states after the correlation to the echo is achieved. Finally, the echo suppressed audio signal also preferably is played for an interval approximately equal to the time delay after the source audio signal again goes off the air.
According to another aspect of the invention, the source audio signal is always sent to the monitoring device (i.e., the performer always hears the signal originating from their microphone) so as to avoid intervals where the performer is unable to hear his or her own voice. Depending on the operating state, two other signals may be added to the source audio signal at different times, which include the return audio signal and the echo suppressed audio signal (e.g., the broadcast program and the broadcast program with echo removed). Preferably, the volume of the added signal is ramped up when adding the signal during a state switch for smoother audio transitions.
According to a further aspect of the invention, a voice activated switch (VOX) or like measure of activity on the source audio signal initiates transitions between at least some of the operating states. For example, the VOX initiates a transition from an initial state where the return audio signal is sent to the monitoring device to one or more states where only the source audio signal is sent to the monitoring device and during which correlation to the audio signal takes place.
According to yet another aspect of the invention, the source audio signal is first correlated to the echo in the return audio signal at a reduced sample rate to define an approximate window (i.e., time interval) for a more exact correlation. This allows the amount of memory and processing needed for the correlation to be reduced, while allowing the correlation to be performed over a much larger time period.
In another aspect of the invention, values from a prior echo correlation are retained for use in a subsequent correlation so as to provide faster response in cases where the delay of the echo is likely to remain the same.
In yet another aspect of the invention, a correlation in a narrow window (based on echo delay from a prior correlation) is run simultaneously with a correlation at a reduced sample rate for a wider window. This allows the invention to detect echo over the wider window, while also providing the fast response in cases where the delay remains the same as the prior correlation.
The above features of the invention allow echo suppression in the specific conditions present in the above live broadcast, distance learning and conferencing scenarios. By defining a multiple state operating sequence under VOX control, the echo suppression according to these aspects of the invention prevents the performer from hearing echo while the correlation is being processed and as the source audio signal goes on and off the air. Further, the dual correlation (at both reduced and full sample rates) allows the echo suppression to be done more quickly for echo at longer time delays.