As traffic over Internet Protocol (IP) networks continues its rapid growth, with the growth of the variety of multimedia conferencing equipment, more and more people use multimedia conferencing as their communication tool. Today the multimedia conferencing communication can be carried over two types of communication methods, the legacy multimedia conferencing method and the new technique of media relay conferencing method. In this disclosure, the terms: multimedia conference, video conference and audio conference may be used interchangeably and the term video conference can be used as a representative term of them.
The legacy multipoint conference between three or more participants requires a Multipoint Control Unit (MCU). An MCU is a conference controlling entity that is typically located in a node of a network or in a terminal which receives several channels from endpoints. According to some criteria, the MCU processes audio and visual signals and distributes them to a set of connected channels. Examples of MCUs include the MGC-100, RMX 2000, which are available from Polycom, Inc. (RMX-2000 is a registered trademark of Polycom, Inc.) A terminal, which may be referred to as a legacy endpoint (LEP), is an entity on the network, capable of providing real-time, two-way audio and/or audio visual communication with another LEP or with the MCU. A more thorough definition of an LEP and an MCU can be found in the International Telecommunication Union (“ITU”) standards, such as but not limited to the H.320, H.324, and H.323 standards, which can be found at the ITU website: www.itu.int.
A common MCU, referred to also as a legacy MCU, may include a plurality of audio and video decoders, encoders, and media combiners (audio mixers and/or video image builders). The MCU may use a large amount of processing power to handle audio and video communication between a variable number of participants (LEPs). The communication can be based on a variety of communication protocols and compression standards and may involve different types of LEPs. The MCU may need to combine a plurality of input audio or video streams into at least one single output stream of audio or video, respectively, that is compatible with the properties of at least one conferee's LEP to which the output stream is being sent. The compressed audio streams received from the endpoints are decoded and can be analyzed to determine which audio streams will be selected for mixing into the single audio stream of the conference. In the present disclosure, the terms decode and decompress can be used interchangeably.
A conference may have one or more video output streams wherein each output stream is associated with a layout. A layout defines the appearance of a conference on a display of one or more conferees that receive the stream. A layout may be divided into one or more segments where each segment may be associated with a video input stream that is sent by a conferee (endpoint). Each output stream may be constructed of several input streams, resulting in a continuous presence (CP) conference. In a CP conference, a user at a remote terminal can observe, simultaneously, several other participants in the conference. Each participant may be displayed in a segment of the layout, where each segment may be the same size or a different size. The choice of the participants displayed and associated with the segments of the layout may vary among different conferees that participate in the same session.
The growing trend of using video conferencing raises the need for low cost MCUs that will enable one to conduct a plurality of conferencing sessions having composed CP video images. This need leads to the new technique of Media Relay Conferencing (MRC).
In MRC, a Media Relay MCU (MRM) receives one or more input streams from each participating Media Relay Endpoint (MRE). The MRM relays to each participating endpoint a set of multiple media output streams received from other endpoints in the conference. Each receiving endpoint uses the multiple streams to generate the video CP image, according to a layout, as well as mixed audio of the conference. The CP video image and the mixed audio are played to the MRE's user. An MRE can be a terminal of a conferee in the session which has the ability to receive relayed media from an MRM and deliver compressed media according to instructions from an MRM. A reader who wishes to learn more about an example of an MRC, MRM, or MRE is invited to read the related U.S. Pat. No. 8,228,363 and U.S. Patent Pub. No. 2012-023611 that are incorporated herein by reference. In the current disclosure, the term endpoint may refer to an MRE or an LEP.
In some MRC systems, a transmitting MRE sends its video image in two or more streams; each stream can be associated with different quality level. The qualities may differ in frame rate, resolution and/or signal to noise ratio (SNR), etc. In a similar way, each transmitting MRE may send its audio in two or more streams that may differ from each other by the compressing bit rate, for example. Such a system can use the plurality of streams to provide different segment sizes in the layouts and different resolutions used by each receiving endpoint, etc. Further, the plurality of streams can be used for overcoming packet loss.
Today, MRC becomes more and more popular. Many video conferencing systems deliver quality levels in parallel within one or more streams. For video for example, the quality can be expressed in number of domains, such as temporal domain (frames per second, for example), spatial domain (HD versus CIF, for example), and/or in quality (sharpness, for example). Video compression standards that can be used for multi-quality streams are H.264 AVC, H.264 annex G (SVC), MPEG-4, etc. More information on compression standards such as H.264 can be found at the ITU website www.itu.int, or at www.mpeg.org.
A reader who wishes to learn more about MRMs and MREs is invited to read U.S. Pat. No. 8,228,363, and U.S. patent application Ser. No. 13/487,703, which are incorporated herein by reference.
In order to achieve good user experience there is a need to synchronize between played video and audio. A common audio and video Real-time Transport Protocol (RTP) comprises an audio video synchronization mechanism. An example of RTP including an audio video synchronization mechanism is described in RFC 3550, the contents of which are incorporated by reference. The mechanism uses timestamps in the RTP header of media packets, and RTCP sender reports (SR) and receiver reports (RR). The SR may include reception report blocks which are equivalent to reception reports that may have been included within an RR. The present disclosure refers to RR also for the cases in which the reception report are included within the SR, and as SR only to the sender report section within the SR. More information on RTP can be found in The Internet Engineering Task Force (IETF) website www.ietf.org.
In order to synchronize between the audio streams and the video streams the transmitting MRE or LEP, inserts timestamps into the header of the audio and video Real-Time Transport Protocol (RTP) packets it sends. The timestamps reflect the capture time of the audio (Audio timestamp, TSa) by the microphone and/or the video (Video timestamp, TSv) by the camera, respectively. The timestamps start for each type of stream (audio or video) at a random value and progress based on a different clock rates for audio and video codecs, 8 KHz for audio and 90 KHz for video, for example.
Periodically, the transmitting endpoint, MRE or LEP, sends for each output stream (Audio or Video) an RTP control (RTCP) sender report (SR). The sender report can include a reference to an associated wall clock at the time the message is sent. The wall clock time (absolute date and time) can be presented using the time format of the network time protocol (NTP), for example. In addition, the RTCP sender report for each stream includes also the associated timestamp, (TSa or TSv, respectively) at the time that the sender report was sent, reflecting the timestamp that would have been placed in an audio/video RTP packet (respectively) if it was transmitted at the time the RTCP message is generated. The time interval between two consecutive RTCP sender reports can be a few seconds, 5 seconds for example.
This mechanism enables the receiving endpoint to correlate between the wall clock of the receiving endpoint and the wall clock of the transmitting endpoint. This correlation can be adjusted each time an RTCP sender report is received. The receiving endpoint can use the wall clock and timestamp in the respective sender reports, to synchronize the received audio and video streams, by adjusting the received audio play time to that of the received video, or vice versa. RTP and RTCP are well known in the art and are described in numerous RFCs. A reader who wishes to learn more about RTP and RTCP is invited to read RFCs 3550, 4585, 4586, and many others that can be found at the Internet Engineering Task Force (IETF) website www.ietf.org, the content of which is incorporated herein by reference.
In a legacy CP transcoding video conferencing, a legacy MCU acts as a receiving entity while obtaining the compressed audio and video streams from the plurality of transmitting legacy endpoints. In addition, a legacy MCU acts as a transmitting entity while transmitting the compressed mixed audio and compressed composed video streams of the conference CP video image toward the plurality of receiving legacy endpoints. In the uplink direction, the RTP timestamps and the RTCP reports provided by the endpoints to the MCU, enable the MCU to synchronize audio and video RTP streams received from multiple sources. In the downlink direction, the MCU generates a video layout and a matching synchronized audio mix. The MCU sends the audio mix and the video layout to the receiving endpoints, each in a single RTP stream, each packet in the stream has its audio timestamps or video timestamps, respectively, accompanied with RTCP reports. In some embodiments of MRC, however, synchronizing between audio and video is more complex because an MRM just relays the media streams while the receiving MRE (RMRE) mixes the audio and composes the CP video images, which are generated by a plurality of transmitting MREs (TMREs), each having its own wall clock and timestamps domains. The mixed audio and the composed CP video image are rendered to a conferee that uses the RMRE.
An example for synchronizing the different streams in MRC is disclosed in the related U.S. Pat. No. 8,228,363 and U.S. patent application Ser. No. 13/487,703. Alternatively, each one of the entities, the MREs as well as the MRM can synchronize their clocks by using Network Time Protocol (NTP) server. Other embodiments of MRM may just relay received RTCP messages from TMREs toward RMREs. The above disclosed methods for synchronizing audio and video in an MRC session consume computing resources at the MRM and/or bandwidth resources between the MRM and the RMREs.
In other embodiments of MRC, due to receiving endpoint processing capabilities, lack of support of audio relay codec or bandwidth limitations, a single audio stream may be sent to a receiving endpoint, which includes a mix of multiple audio streams from the most active speakers, while the video streams of selected MREs are sent separately to the receiving MRE, which composes the streams into a CP video image. In such a situation the received video streams cannot be synchronized to the received audio mix.