There are several applications in which multiple video views can be presented on one or more video displays. One example is multipoint videoconferencing systems, where one or more video streams arrive at a receiver and must be presented on a common display. High-end videoconferencing systems in fact may employ two or more displays for that purpose. As the number of participants grows, it becomes impossible to fit all the video windows on a given display area. At the same time, if the display is that of a computer, it may be shared by other applications and thus the user may restrict the videoconferencing application window to a subset of the computer's screen. Another example is a video surveillance application, where feeds from multiple cameras may arrive at a control station, where again they have to be displayed in one or more physical display devices (computer or TV monitors). Yet another application is multi-program television, where a single device displays multiple programs at the same time. Moreover, with video programming increasingly being available on the Internet, it is easy to create players that provide functionality similar to the traditional picture-in-picture mode of analog or digital TVs, but with a larger set of views.
The organization of multiple views on a given screen is typically performed following a rectangular grid organization pattern. For example, with four feeds of the same size, one can partition the screen area into a rectangular array of 2-by-2 smaller views or windows, and display each feed in its own window. Typically, the smaller views contain scaled down versions of the original feeds, so that they fit within the allocated screen area. In conversational applications such as videoconferencing, it is also common to display the active speaker in a larger view, e.g., occupying one of the corners of the screen, with other participants shown in smaller views surrounding the main one at its sides.
In traditional videoconferencing systems that use a transcoding Multipoint Control Unit (MCU), the composition of the individual feeds happens at the MCU itself. The MCU receives the incoming feeds from transmitting participants, decodes them, and composes them into a new frame after appropriate downscaling. It then encodes the composited signal and transmits it to the intended recipient(s). If the MCU supports personalized layout, then the composition and encoding are performed individually for each recipient. A given participant selects the desired layout, and informs the MCU in order for it to produce the desired composition. The composition options are pre-configured at the MCU, and any changes to the available patterns require its redesign or reprogramming.
In a general setting of a video player receiving and displaying multiple video sources, possibly also originating from different locations, it is the responsibility of the player to scale down and compose the individual video pictures to the displayed picture. This provides complete flexibility to the player to organize the layout in any way it chooses, but it also results in a total bit rate requirement that is the sum of the bit rates of the individual sources. In contrast, in a videoconferencing setting with a transcoding MCU, the bit rate of the received composited signal is that of a single video source. It is noted, however, that the need of the MCU to decode and re-encode the video streams adds considerable latency, and also requires substantial computational power.
A fundamental limitation in resolving the tradeoff between flexibility, complexity, and bit rate overhead in systems featuring multiple video views, is the fact that such systems typically operate using traditional single-layer video codecs, such as 11.264 AVC, VC-1, MPEG-4, MPEG-2, and VP6/VP7. An alternative coding technique is layered or scalable coding. Scalable coding is used to generate two or more “scaled” bitstreams collectively representing a given medium in a bandwidth-efficient manner at a corresponding number of fidelity points. Scalability can be provided in a number of different dimensions. For example, a video signal may be scalable coded in different layers at CIF and QCIF resolutions, and at frame rates of 7.5, 15, and 30 frames per second (fps). Depending on the codec's structure, any combination of spatial resolutions and frame rates may be obtainable from the coded bitstream. The bits corresponding to the different layers can be transmitted as separate bitstreams (i.e., one stream per layer) or they can be multiplexed together in one or more bitstreams. For convenience in description herein, the coded bits corresponding to a given layer may be referred to as that layer's bitstream, even if the various layers are multiplexed and transmitted in a single bitstream.
Video codecs specifically designed to offer scalability features include, for example, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T 11.262) and the currently developed H.264 Scalable Video Coding (H.264 SVC) extension (Annex G of ITU-T Recommendation H.264, November 2007, incorporated herein by reference in its entirety). Scalable audio codecs include ITU-T G.729.1 and Speex (see www.speex.org).
Scalable video coding (SVC) techniques specifically designed for video communication are also described in commonly assigned international patent application No. PCT/US06/028365 “System and Method for Scalable and Low-Delay Videoconferencing Using Scalable Video Coding.” It is noted that even codecs that are not specifically designed to be scalable can exhibit scalability characteristics in the temporal dimension (e.g., MPEG-2 or H.264 AVC).
Scalable codecs typically have a pyramidal bitstream structure. Using H.264 SVC as an example, a first fidelity point is obtained by encoding the source using standard H.264 techniques (Advanced Video Coding—AVC). An additional fidelity point can be obtained by encoding the resulting coding error (the difference between the original signal and the decoded version of the first fidelity point) and transmitting it in its own bitstream. This pyramidal construction is quite common (e.g., it was used in MPEG-2 and MPEG-4). The first (lowest) fidelity level bitstream is referred to as the base layer, and the bitstreams providing the additional fidelity points are referred to as enhancement layers. The fidelity enhancement can be in any fidelity dimension. For example, for video it can be temporal (frame rate), quality (Signal-to-Noise ratio or SNR), spatial (picture size), or 3-D (e.g., with a stereoscopic enhancement layer). For audio, it can be temporal (samples per second), quality (SNR), or additional channels.
Another example of a scalable or layered representation is multiple description coding. Here the construction is not pyramidal: each layer is independently decodable and provides a representation at a basic fidelity; if more than one layer is available to the decoder, however, then it is possible to provide a decoded representation of the original signal at a higher level of fidelity. One example is transmitting the odd and even pictures of a video signal as two separate bitstreams. Each bitstream alone offers a first level of fidelity, whereas any information received from other bitstreams can be used to enhance this first level of fidelity. In this sense, any of the streams may act as a base layer. If all streams are received, then a complete representation of the original signal at the maximum level of quality afforded by the particular representation is obtained.
Yet another example of a layered representation is simulcasting. In this case, two or more independent representations of the original signal are encoded and transmitted in their own streams. This is often used, for example, to transmit Standard Definition TV material and High Definition TV material. It is noted that simulcasting is a special case of pyramidal scalable coding where no inter-layer prediction is used. In the following, all such layered coding techniques are referred to as scalable coding, unless explicitly specified otherwise.
Scalable coding offers significant advantages for packet-based video and audio communication, including reduced delay, reduced complexity, and improved system scalability.
International Patent Application No. PCT/US06/028365 discloses techniques where the Scalable Video Communication Server (“SVCS”) (or Scalable Audio Communication Server (“SACS”), in the case of a scalable audio signal) may utilize the scalable aspects of the audio signal to ensure smooth transitions between speakers by transmitting the full resolution signal for the active speaker and base layer only for a number of other participants (prioritized by, for example, computed volume).
For example, SVCS units hosted on standard PC-based hardware platforms can support 100 users or more. The ability to effectively host sessions with a large number of users poses challenges for view layout management as, for example, with more than 10-15 users it becomes difficult to effectively combine all users on a single display. The disclosed subject matter presents systems and methods for effectively managing view layout in such systems.