Computer networks (e.g., the Internet) have now supplanted traditional distribution systems (e.g., mail or telephone) for the delivery of media and information. Recent advances in multimedia and telecommunications technology have involved the integration of video and audio communication and conferencing capabilities with Internet Protocol (“IP”) communication systems such as IP PBX, instant messaging, web conferencing, etc. In order to effectively integrate video communication into such systems, the systems must generally support both point-to-point and multipoint communications. Multipoint servers (also referred to as conference bridges, multipoint communications. Multipoint servers (also referred to as conference bridges, multipoint conferencing units, or “MCUs”) employed in such applications must mix media streams from multiple participants in a multiparty conference and distribute them to all conference participants. Preferably, the MCUs should also provide options including: (1) continuous presence (e.g., so that multiple participants can be seen at same time); (2) view or layout personalization (e.g., so that each participant can choose his or her own view of the other participants—some of the other participants may be viewed in large format and some in small format); (3) error localization (e.g. when error in transmission occurs, the error is resolved between that participant and the server); (4) random entry (e.g. a new participant entrance into the conference has no or minimal impact on other participants); and (5) rate matching (e.g., so that each participant may be connected via a different network connection with different bandwidth and may receive data from the conference bridge at its own rate).
Current MCU solutions, which are referred to as “transcoding” MCUs, achieve these advantageous functions by decoding all video streams in the MCU, creating a personal layout for each participant and re-encoding a participant-specific data stream for transmission to each participant, taking into account, e.g., that participant's available bandwidth, etc. However, this solution adds significant delay to the transmission of the video stream, degrades the quality of the video data, and is costly to develop and deploy (such systems usually require complex, dedicated digital signal processors).
An alternative MCU solution is based on the so-called “switching” MCU. In this solution, only the video and/or audio signals of a single selected participant (i.e., an “active speaker”) are transmitted from the MCU to one or all the other participants. The active speaker/participant may be selected by applying quantitative measures of voice activity on the audio signals of all participants. While the selection of the active speaker is typically performed at the MCU, the calculation of voice activity indicator(s) also may be performed on the end-points (prior to transmission). Switching MCUs involve less DSP processing and are less complex than the transcoding MCUs, but they correspondingly have less functionality (e.g., no error localization, no rate matching, limited random entry functionality).
Further, attempts have been made to implement methods specific to one video standard to combine the video streams in the compressed domain. A method based on the ITU-T H.261 standard calls for endpoints to transmit H.261 QCIF images to a conference bridge which then combines 4 of the QCIF images to create one CIF image. Newer video codecs such as ITU-T H.263 and H.264 enable the combination or “compositing” of coded pictures into a bigger picture by considering each of the constituent sub-pictures to be a separate slice of the bigger picture. These and other like methods tend to be very specific to the video compression standards and do not support personal layout (i.e., all participants are forced to watch a given participant in the same resolution), error resilience, or rate matching. They also create new challenges for the MCU designer in terms of proper synchronization between video and audio, and jitter buffer management. Other solutions are based on sending all data streams to all participants; these solutions do not support rate matching or selection of resolution by the endpoints.
Currently available video communication solutions are also not resilient to packet loss and perform unpredictably except in expensive and dedicated network configurations. Network error conditions that may not pose a problem for most other applications can result in unacceptable quality in videoconferencing.
New digital video and audio “scalable” coding techniques directed to general improvements in coding efficiency, also have a number of new structural characteristics. Specifically, an important new characteristic is scalability. In scalable coding, an original or source signal is represented using two or more hierarchically structured bitstreams. The hierarchical structure implies that decoding of a given bitstream depends on the availability of some or all other bitstreams that are lower in hierarchy. Each bitstream, together with the bitstreams it depends on, offer a representation of the original signal at a particular temporal, quality (e.g., in teens of signal-to-noise ratio, or SNR), or spatial resolution (for video).
The term ‘scalable’ does not refer to magnitude or scale in terms of numbers, but rather to the ability of the encoding technique to offer a set of different bitstreams corresponding to efficient representations of the original or source signal at different resolutions or qualities in general. The forthcoming ITU-T H.264 Annex F specification (referred to as Scalable Video Coding, SVC) is an example of a video coding standard that offers video coding scalability in all of temporal, spatial, and temporal resolutions, and is an extension of the H.264 standard (also known as Advanced Video Coding, or AVC). Another much older example is ISO MPEG-2 (also published as ITU-T H.262), which also offered all three types of scalability. ITU G.729.1 (also known as G.729EV) is an example of a standard offering scalable audio coding.
Scalability in coding was designed as a solution for video and audio distribution problems in streaming and broadcasting with a view to allow a given system to operate with varying access networks (e.g., clients connected with different bandwidths), network conditions (bandwidth fluctuation), or client devices (e.g., a personal computer that uses a large monitor vs. a handheld device with a much smaller screen).
Consideration is now being given to improved multimedia conferencing applications. In particular, attention is directed toward improving conference server architectures by using scalable video and audio coding techniques. Desirable conference server architectures and data coding techniques will support personal layout, continuous presence, rate matching, error resilience and random entry, as well as low delay.