Reducing or Eliminating Echo Audio and Video Conferences
Audio conferences and video conferences became very popular in recent years. Starting as tools for business use, they had become available to private users and are widely employed for personal and social use.
One source of disturbance to the audio channel of a conference call (either an audio call or a video call) is the echoing of audio signals that are transmitted from the microphone of a first party in the call to the speaker of a second party, then to the microphone of the second party and finally back to the speaker of the first party. The problem and its prior art solutions are explained in the Wikipedia article entitled ‘Echo Suppression and Cancellation” which begins as follows:                Echo suppression and echo cancellation are methods in telephony to improve voice quality by preventing echo from being created or removing it after it is already present. In addition to improving subjective quality, this process increases the capacity achieved through silence suppression by preventing echo from traveling across a network.        These methods are commonly called acoustic echo suppression (AES) and acoustic echo cancellation (AEC), and more rarely line echo cancellation (LEC). In some cases, these terms are more precise as there are various types and causes of echo with unique characteristics, including acoustic echo (sounds from a loudspeaker being reflected and recorded by a microphone, which can vary substantially over time) and line echo (electrical impulses caused by, e.g., coupling between the sending and receiving wires, impedance mismatches, electrical reflections, etc which varies much less than acoustic echo). In practice, however, the same techniques are used to treat all types of echo, so an acoustic echo canceller can cancel line echo as well as acoustic echo. “AEC” in particular is commonly used to refer to echo cancelers in general, regardless of whether they were intended for acoustic echo, line echo, or both.        Echo suppressors were developed in the 1950s in response to the first use of satellites for telecommunications, but they have since been largely supplanted by better performing echo cancellers.        Although echo suppressors and echo cancellers have similar goals preventing a speaking individual from hearing an echo of their own voice the methods they use are different:        Echo suppressors work by detecting a voice signal going in one direction on a circuit, and then inserting a great deal of loss in the other direction. Usually the echo suppressor at the far-end of the circuit adds this loss when it detects voice coming from the near-end of the circuit. This added loss prevents the speaker from hearing his own voice.        Echo cancellation involves first recognizing the originally transmitted signal that re-appears, with some delay, in the transmitted or received signal. Once the echo is recognized, it can be removed by subtracting it from the transmitted or received signal. This technique is generally implemented digitally using a digital signal processor or software, although it can be implemented in analog circuits as well.        ITU standards G.168 and P.340 describe requirements and tests for echo cancellers in digital and PSTN applications, respectively.Acoustic echo cancellers are well known in the art. A block diagram of a typical prior art acoustic echo canceller is shown in FIG. 1 below (which is taken from the 1996 publication Acoustic Echo Cancellation. Algorithms and Implementation on the TMS320C8x. David Qi. Digital Signal Processing Solutions. SPRA063. May 1996 (hereinafter ‘the Qi publication’), which is incorporated herein by reference in its entirety).        
In FIG. 1 (PRIOR ART), x(n) is the desired signal we want to output to the far end (the near-end user voice) and r(n) is the undesired signal (i.e. the disturbance entering the near-end microphone as a result of the near-end speaker playing the far-end signal y(n)) whose elimination is desired. The way AEC works is to use the far-end incoming signal as a reference input to an adaptive filter, based on which the outgoing near-end signal is filtered by subtracting the echo. This way the far-end user does not hear his voice coming back as an echo. In the example of FIG. 1, the AEC includes a Normalized Least Mean Squares (NLMS) adaptive filter.
This implies that undesirable signals for which there is no available reference cannot be filtered out and will reach the far-end user. Any signal that is generated at the near end but (unlike the near-end user's voice) should not be heard on the other side of the call is an undesirable signal. An example for an undesirable signal is a radio playing in the background of the near-end user which we do not want to be heard on the far-end side. However, as explained above, the current AEC solutions are not effective in eliminating such signals from being sent out.
The problem of undesired background signals is typically dealt with by trying to avoid generating them in the first place. A person in the same room as the user that speaks loudly is quickly being hushed, a radio or music player in the room is being turned off or turned down to a low volume level, and if the source of the undesired signal cannot be turned off it may be physically moved away from the microphone to reduce the disturbance it causes.
Conventional techniques for AES or AEC are disclosed in some or all of the following 72 US patent documents, all of which are incorporated herein by reference: U.S. Pat. No. 5,305,307, U.S. Pat. No. 5,661,813, U.S. Pat. No. 5,706,344, U.S. Pat. No. 5,761,318, U.S. Pat. No. 5,796,819 U.S. Pat. No. 5,933,495, U.S. Pat. No. 5,937,060, U.S. Pat. No. 6,246,760, U.S. Pat. No. 6,473,409 U.S. Pat. No. 6,553,122, U.S. Pat. No. 6,556,682, U.S. Pat. No. 6,597,787, U.S. Pat. No. 6,694,020 U.S. Pat. No. 6,925,176, U.S. Pat. No. 6,928,161, U.S. Pat. No. 6,961,422, U.S. Pat. No. 6,968,064 U.S. Pat. No. 7,003,099, U.S. Pat. No. 7,006,624, U.S. Pat. No. 7,035,398, U.S. Pat. No. 7,039,197 U.S. Pat. No. 7,046,794, U.S. Pat. No. 7,054,437, U.S. Pat. No. 7,062,040, U.S. Pat. No. 7,099,458 U.S. Pat. No. 7,117,145, U.S. Pat. No. 7,142,665, U.S. Pat. No. 7,171,003, U.S. Pat. No. 7,426,270 U.S. Pat. No. 7,433,463, U.S. Pat. No. 7,464,029, U.S. Pat. No. 7,545,926, U.S. Pat. No. 7,698,133 U.S. Pat. No. 7,747,001, U.S. Pat. No. 7,773,743, U.S. Pat. No. 7,831,035, U.S. Pat. No. 7,890,321, U.S. Pat. No. 8,064,966, U.S. Pat. No. 8,068,884, U.S. Pat. No. 8,077,641, U.S. Pat. No. 8,085,949, U.S. Pat. No. 8,111,833, U.S. Pat. No. 8,150,027, U.S. Pat. No. 8,175,871, U.S. Pat. No. 8,189,767, U.S. Pat. No. 8,194,850, U.S. Pat. No. 8,204,210, U.S. Pat. No. 8,265,289, U.S. Pat. No. 8,275,142, U.S. Pat. No. 8,306,214, U.S. Pat. No. 8,320,554, U.S. Pat. No. 8,325,910, U.S. Pat. No. 8,325,934, U.S. Pat. No. 8,345,860, U.S. Pat. No. 8,380,253, U.S. Pat. No. 8,401,203, U.S. Pat. No. 8,498,407, U.S. Pat. No. 8,600,038, U.S. Pat. No. 8,605,890, U.S. Pat. No. 8,644,494, U.S. Pat. No. 8,712,068, U.S. Pat. No. 8,831,210, U.S. Pat. No. 8,838,184, U.S. Pat. No. 8,934,622, U.S. Pat. No. 8,934,945, U.S. Pat. No. 9,008,302, U.S. Pat. No. 9,036,815, U.S. Pat. No. 9,053,697, U.S. Pat. No. 9,071,900, U.S. Pat. No. 9,088,336, U.S. Pat. No. 9,100,090, and U.S. Pat. No. 9,225,843.
Set Top Boxes
A set-top box (STB) is an information appliance device that generally contains an input module for receiving data corresponding to video content from the external world and an output module for providing a TV signal to an external television set or to other display devices. The TV signal corresponds to the video content and is in a form that can be displayed on the television screen or on the other display devices. STBs may receive the data corresponding to the video content from cable networks, satellite networks, over-the-air television broadcasts or from digital networks such as the Internet. An STB may have additional functionalities that are not necessarily directed to displaying of video content. Examples are Internet browsing, conferencing, general-purpose computing, etc.
FIG. 2A (prior art) describes a system comprising a conventional STB 100 in local communication with external TV set 200. For the present disclosure, when two devices are said to be in communication with each other, this term is intended to broadly cover the case of one-way communication (one example of which is illustrated in FIG. 2) or bidirectional communication.
In one example, STB 100 may be a IPTV (Internet Protocol television) TV STB or an OTT (over-the-top) TV—in these examples, packets of the digital content data are received into STB 100 from the cloud via input port 110. In another example, satellite receiver (NOT SHOWN) may receive the digital content data. In yet another example, a TV receiver (NOT SHOWN) may receive the digital content data as broadcast over-the-air. In all of these cases, digital content data 130 is said to be received into set-top box 100 via a device port(s) 110 which is defined broadly herein. In yet another example, data is locally uploaded to STB 100 via device port(s) 110, which may be or include a USB port. The incoming data may be handled and/or pre-processed by STB processing module 120 which, for example, may handle I/O-related operations.
For the present disclosure, any reference to a single ‘port’ may also refer to multiple ports and any reference to multiple ports may also refer to a single port. Thus, output port 112 via which the TV signal is output may refer to a single port for both video and audio or to a first port for video and a second (and separate) port for audio.
For the present disclosure, a ‘TV signal’ comprises two components (i) a TV video signal and (ii) a TV audio signal corresponding to an audio track of the TV signal.
For the present disclosure, when two ports are illustrated as separate ports they may in fact be separate or may be parts of a single port. Thus, in one example, even though input port 110 and output port 112 are illustrated separately, they may in fact be part of a common port.
For the present disclosure, unless stated otherwise, a ‘port’ may be wired or wireless. Thus, STB box 100 may be in wired and/or wireless communication with external TV set 200 via respective ports 112, 210.
Optionally, STB 100 includes non-volatile storage 118. In contrast to STB processing module 120 which may handle on-the-fly processing of incoming digital data, non-volatile storage 118 may be used for longer-term storage and may include, for example, flash memory or magnetic media or optical storage.
The digital content data may be encoded in any known format—example standards of formats for encoding and/or packaging digital content data 130 include but are not limited to H.264, H.265 (HEVC), AVI, all parts of MPEG4, MPEG2, any ISO-based media file format, Ogg, ASF, QuickTime, RealMedia, Matroska, and DivX.
Digital data 130 is made available to TV signal generator 140 which generates a TV signal from the digital data 130. Digital data 130 corresponds to content of the TV signal—e.g. a first portion of digital data 130 corresponds to the TV video signal and a second portion of digital data 130 corresponds to the TV audio signal.
In the particular example of FIG. 2A, TV signal generator 140 comprises: (i) TV video signal generator 145 for generating the TV video signal from digital data 130 (ii) TV audio signal generator 150 for generating the TV audio signal from digital data 130.
Optionally, ‘generation’ of a TV video signal and/or of a TV audio signal and/or of a TV signal includes decoding the digital data 130 corresponding to content of the TV signal. Towards this end, TV video signal generator 145 may include a video CODEC for decoding digital data 130 or a portion thereof—for example, a hardware video CODEC. In different examples, the video CODEC may comply with any known standard, including but not limited to H.264, H.265, WMV (Windows Media Video), On2 (e.g. VPx CODES), or any other standard.
Furthermore, digital data 130 is made available to TV audio signal generator 150 which generates a TV audio signal (i.e. audio stream) from the digital data 130. Optionally, this includes decoding audio of the digital data. TV audio signal generator 150 may include an audio CODEC for decoding digital data 130 or a portion thereof—for example, a hardware audio CODEC.
As shown in FIG. 2A, the TV signal is output to external TV set 200 via any appropriate analog and/or digital media port (e.g. plug or socket). Although FIG. 2A illustrates output of a single TV signal, this is not a limitation—alternatively, a first portion of the TV signal (e.g. TV video signal) may be exported to external TV set 200 via a first media port (i.e. considered part of port 112) and a second portion of the TV signal (e.g. TV audio signal) may be exported to external TV set 200 via a second media port (i.e. considered part of port 112).
Exemplary media ports include but are not limited to HDMI (High Definition Multi-media interface) sockets, DVI (digital video interface) connectors, DisplayPort connectors, S-video plugs, VGA (Video Graphics Array) ports, audio sockets, and USB connectors. External television (TV) set 200 comprises a screen 210 for presenting the TV video signal and a speaker 220 for playing the TV audio signal.
FIG. 2B is a Flow Chart Describing Operation of the System of FIG. 2A.
The method of FIG. 2B comprises: a. obtaining S101, by the STB 100, digital data 130 corresponding to content of a TV signal (e.g. via input port 110); b. based on the digital data 130, generating S105, by the STB 100 (e.g. by TV signal generator 140), the TV signal—i.e. the TV video signal (e.g. generated by TV video signal generator 145) and the TV audio signal (e.g. generated by TV audio signal generator 150); and c. outputting S109, from the STB 100 to the local external TV set 200, the TV signal to cause the local external TV set 200 to play the STB-generated TV signal (i.e. by presenting the TV video signal on screen 210 and causing speaker 220 to play TV audio signal).
Video Conferencing Concurrent with Television
Recently a new use of video conferencing became common, in which the conferencing (or “chatting” as it is sometimes called) takes place in parallel to joint watching of TV content by the participants of the conference. This is part of the trend of “social TV”—friends that sit in front of their respective TV screens at their respective homes and video-chat with each other while watching some TV content on the same screen on which the video chatting windows are presented. The watched content may be common to all participants in the session, but this is not necessarily so and there are use cases in which each user is watching a different channel and the users are updating each other on what they see. The participants may exchange comments about what they see on the screen, laugh or cheer at each other (for example when a goal is scored in a football match against or for one's favorite football team), or question each other about information related to a viewed program. The social TV experience is typically supported by an STB that provides the TV content to the TV screen, and has the ability to modify the signal going out to the TV screen such that the viewer sees additional information on top of the regular TV content provided by the TV operator. The additional information may be video windows showing images captured by local or remote cameras, text for user notifications, sound alerts, or any other type of information.
In such viewing scenarios a user wants to listen to both the audio track of the watched TV program and to the audio track of the conference at the same time. However, it is also desired that the TV audio track will not be transmitted out to the far end. If the TV audio does get to the far end the remote user will hear it twice with a small delay (in case both users are watching the same program) or will hear two different audio tracks garbled together (if the two users are watching different programs). The prior art solution of turning off or turning down the volume level of the TV is not applicable in such case because the near-end user needs it to be high enough to be conveniently heard by him.