1. Technical Field
The present invention relates to compressed digital video delivery systems such as cable TV (CATV), satellite TV, Internet protocol TV (IPTV) and the Internet based video distribution systems. In particular, it relates to the use of a low-delay and layered codec and the corresponding low-delay transport, typically used for videoconferencing systems. The disclosed digital video delivery system allow a group of watchers to watch one or several selected video content in such a way that the video is synchronously displayed regardless of location and network bandwidth.
2. Background Art
Subject matter related to the present application can be found in U.S. patent application Ser. No. 12/015,956, filed and entitled “SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING,” Ser. No. 11/608,776, filed and entitled “SYSTEMS AND METHODS FOR ERROR RESILIENCE AND RANDOM ACCESS IN VIDEO COMMUNICATION SYSTEMS,” Ser. No. 11/682,263, filed and entitled “SYSTEM AND METHOD FOR PROVIDING ERROR RESILIENCE, RANDOM ACCESS AND RATE CONTROL IN SCALABLE VIDEO COMMUNICATIONS,” 61/172,355, filed and entitled “SYSTEM AND METHOD FOR INSTANT MULTI-CHANNEL VIDEO CONTENT BROWSING IN DIGITAL VIDEO DISTRIBUTION SYSTEMS,” Ser. No. 11/865,478, filed and entitled “SYSTEM AND METHOD FOR MULTIPOINT CONFERENCING WITH SCALABLE VIDEO CODING SERVERS AND MULTICAST,” Ser. No. 11/615,643, filed and entitled “SYSTEM AND METHOD FOR VIDEOCONFEFERENCING USING SCALABLE VIDEO CODING AND COMPOSITING SCALABLE VIDEO SERVERS,” and co-pending provisional U.S. Patent Application Ser. No. 61/060,072, filed and entitled “SYSTEM AND METHOD FOR IMPROVED VIEW LAYOUT MANAGEMENT IN SCALABLE VIDEO AND AUDIO COMMUNICATION SYSTEMS”, as well as U.S. Pat. No. 7,593,032, filed and entitled “SYSTEM AND METHOD FOR A CONFERENCE SERVER ARCHITECTURE FOR LOW DELAY AND DISTRIBUTED CONFERENCING APPLICATIONS,”. All of the aforementioned related applications and patents are hereby incorporated by reference herein in their entireties.
There are many applications where a group of people would like to participate or collaborate while watching live or video content. A few of these are as follows:
Sports events: Sports fans visit large stadiums or sports bars not only to watch a game, but also to share with their buddies heart rendering excitement, cheer together when their team scores, and share viewpoints during the game.
Education: Many schools have conference rooms from where the school can multicast a lecture to students. Some hospitals have capabilities to show every step of a surgery live to an audience. The students or the doctors may want to watch the lecture or surgery together so that they can share their viewpoints while watching the content remotely.
Gaming: Many TV game shows provide means for interaction with the audience through concepts such as “lifeline” or “helpline,” or simply asking the audience to vote on a specific question or scene. There are also gaming applications where the TV station may want to show remote players or the remote players may want to see one another and chat about the game while playing it.
Corporate Announcements: There may be company meetings, corporate announcements, customer presentations, etc., where a group of participants may want to share viewpoints while watching the corporate announcement.
News and Journalism: News events from all around the world seldom turn into the “talk of the day.” Many of the news events are of public interest. People would like to discuss, debate, and respond within groups while watching the news.
One can generate many other examples—e.g., fashion shows, family events, etc.—where a group collaborates over specific video content in real-time. Novel techniques which employ a low-delay and layered codec and its associated low-delay transport are described in co-pending U.S. patent application Ser. Nos. 12/015,956, 11/608,776, and 11/682,263, as well as U.S. Pat. No. 7,593,032.
In digital video codecs, alternatively known as digital video coding/decoding techniques (e.g., MPEG-2, H.263 or H.264, and packet network delivery), varying transport delays are introduced at each receiver, preventing synchronous play-back in a multicasting or broadcasting system based on these technologies. These delays are caused by: (a) network delays due to varying route lengths between source and receiver, and (b) delays resulting from buffering by the decoder at the receiving end, which is necessary to alleviate the effects of: (i) delay jitter caused by varying queuing delays in transport network routers; (ii) packet losses in the network; and/or (iii) bandwidth changes in the transport network (such as variable link bandwidths experienced in wireless networks).
IPTV and other packet network based video distribution systems suffer from both network delays and buffering delays. In the evolving IPTV environment, particularly where video is delivered over a best effort network such as the public Internet, where the network conditions are totally unpredictable, these delays can be significant (for example, up to a few tens of a second). Depending on the location of each receiver in reference to the video source, the delay variation component due to network conditions can be significant, and each receiver can receive the same video frame at a different time.
The source video synchronized conferencing system of the present invention has two overlaid architectures, each with different requirements:
(1) Synchronous Video Distribution: A video source sends specific video content to a group of users (one-way) such that each user can watch exactly the same video at the same time. This system requires “delay equalization,” although there is no strict delay limitation.
(2) Multipoint Video Conferencing: A group of users can interact with each other (two-ways) using a multipoint video conferencing system. This system requires strict “delay control,” since interactions must take place in real-time, requiring strict delay bounds.
While it is possible to overlay a traditional streaming based video distribution system with a typical conferencing system to approximate the system disclosed in this invention, this type of an overlay cannot control delay to achieve the required synchronized watching.
Network delay equalization to achieve synchronicity can be done by employing different methods:
(1) Maximum Delay Based Equalization: This method employs an out of band control layer, which measures the delay between the video source and each receiver in the group, and adjusts each receiver's display time according to the maximum delay. This method requires the measurement of all delays and a means for determining and distributing the value of the maximum delay to all participants throughout the session, because: (a) changing network conditions may result in changing delays as the video is being delivered, and (b) there may be new users with varying delays added to the group.
(2) Longest Route Delay Based Equalization: With this technique, the video source sends the same video to each receiver, but along network routes that give essentially the same amount of delay (if there are multiple routes available for each receiver). For example, when the video source is in New York, and there are two users in New York and two users in California, the computation of route lengths results in serving the users in New York using a longer route, for example, through Atlanta and back to New York, to attain the same geographical distance between the video source in New York and users in both New York and California. This method may not be practical where no such equalizing routes are available. Even where such routes are available, the system uses the network inefficiently by selecting long routes for receivers that are closer to the video source, and it is very difficult, if not impossible, to deal with path delay variations.
Although the above described methods or similar techniques can be used to equalize the network delay in a streaming based video distribution system, receiver side buffering delay can be even more significant. The decoder of a streaming system relies on buffering at the receiver as a mechanism for error resilience. Network-triggered error conditions can occur due to congestion, even when transport delays are equalized or non-existent. Buffering at the receiver due to retransmission of lost packets causes insurmountable delay variations, as described in co-pending U.S. patent application Ser. Nos. 11/608,776 and 11/682,263. Although the largest receiver buffer size can be communicated to all receivers (similar to maximum network delay based equalization) so that each receiver delays its display until the receiver with the largest buffer can display the video, none of these systems can be used for live interaction among video watchers.
In order to eliminate the buffering delays at the receiver, the present invention uses a video conferencing system for the aforementioned video distribution system instead of a streaming system. However, given that transport delays are usually the biggest component of delay, a generic video teleconferencing codec does not alleviate the delay problems altogether. Therefore, the present invention uses the low-delay layered codec and its corresponding low-transport delay system, described in co-pending U.S. patent application Ser. Nos. 12/015,956, 11/608,776, and 11/682,263, as well as U.S. Pat. No. 7,593,032, which generates multiple layers of video and protects the vital base layer only. These techniques eliminate the need for any buffering at the receiver by introducing slight performance degradation in the event of packet loss or excessive packet delay. In addition, layered codec instantly generates synchronization frames without any need for future frames. The same system is employed for the multipoint video conferencing as well.
Traditional video codecs, such as H.261, H.263 (used in videoconferencing) or MPEG-1 and MPEG-2 Main Profile (used in Video CDs and DVDs, respectively), are designed to provide a single bitstream at a given bitrate. Although some video codecs are designed without rate control, thus resulting in a variable bit rate stream (e.g., MPEG-2), video codecs used for communication purposes establish a target operating bitrate depending on the specific infrastructure. These designs assume that the network is able to provide a constant bitrate due to a practically error-free channel between the video source and the receiver. The H-series codecs, designed specifically for person-to-person communication applications, offer some additional features to increase robustness in the presence of channel errors, but are still only tolerant to a very small percentage of packet losses (for example, 2-3%).
A limitation of single layer coding exists where a lower spatial resolution is required, such as a smaller frame size. The full resolution signal must be sent and decoded at the receiving end, thus wasting bandwidth and computational resources, with downscaling performed at the receiver or at a network device. However, support for lower resolutions is essential in the overlay video conferencing application, as one goal is to fit as many users and mini browsing windows (MBWs) as possible into a specific screen area, which are naturally of lower resolution than the main video program.
Layered codec, alternatively known as layered coding or scalable codecs/coding, is a video compression technique that has been developed explicitly for heterogeneous environments. In such codecs, two or more layers are generated for a given source video signal: a base layer and at least one enhancement layer. The base layer offers a basic representation of the source signal at a reduced quality, which can be achieved, for example, by reducing the Signal-to-Noise Ratio (SNR) through coarse quantization, using a reduced spatial and/or temporal resolution, or a combination of these techniques. The base layer can be transmitted using a reliable channel, i.e., a channel with guaranteed or enhanced Quality of Service (QoS). Each enhancement layer increases the quality by increasing the SNR, spatial resolution, or temporal resolution, and can be transmitted with reduced or no QoS. In effect, a user is guaranteed to receive a signal with at least a minimum level of quality of the base layer signal.
Another objective of using layered coding in synchronized viewing is to offer a personalized view or layout on each video display (i.e., each receiver may display different numbers and sizes of MBWs); and rate matching (i.e., each receiver can use IP network connections with different bandwidths and can need to receive different data rates).
In a layered video coding architecture, the source video (for example, a football game playing on a TV channel) and the receivers in the group transmit a layered bitstream (base layer plus one or more enhancement layers) using a corresponding number of physical or virtual channels on the network, such as the public Internet. The base layer channel is assumed to offer higher QoS, whereas the enhancement stream channels offer lower or even no QoS. This architecture ensures the base layer always arrive at the decoder with almost no loss.
Losses in the enhancement streams will result in a graceful degradation of picture quality. The encoder accordingly selects the correct amount and type of information that is required based on user preference information, such as number or size of MB Ws, or properties of the receiver, such as available bandwidth, and forwards only that information to the user's receiver. Little or no signal processing is required of the layered encoder in this respect; the layered encoder simply reads the packet headers of the incoming data and selectively forwards the appropriate packets to each user. The various incoming packets are aggregated to two or more channels (for each MBW), and base layer packets are transmitted over the high reliability channel.
If a user elects to enlarge one MBW to the main screen (to view the video in large size), the main video program can be swapped to an MBW. As a result, only the base layer of the video content is sent and displayed at that MBW.
The use of the layered codec can eliminate the need to decode and re-encode the video on the encoder side or at network devices (e.g., multipoint control units) to generate different special/temporal patterns for each user, and therefore provides no algorithmic delay. Most significantly, the computational requirements on the encoder are reduced greatly.
The use of a conferencing system can imply use of a Scalable Video Conferencing Switch (SVCS) to achieve the effects of multipoint conferencing and the utility of sending only the base layer or the base layer and one or more enhancement layers based on user MBW preferences and network capabilities.
IPTV video distribution to a large number of receivers using streaming technology is well understood in the prior art. Although SVCS-based video conferencing can be used to distribute the source video to receivers, it is worthwhile to mention the typical video distribution techniques for streaming video. There are two key approaches: (1) Application Layer Multicasting, as described in Suman Banerjee, Bobby Bhattacharjee and Christopher Kommareddy, “Scalable application layer multicast,” ACM SIGCOMM Computer Communication Review, Volume 32, Issue 4 (October 2002), is performed above the IP layer; and (2) IP layer multicasting is performed by the IP network.
Application Layer Multicasting can be implemented using Content Distribution Networks (CDN) where the content of the video source is replicated and cached at a downstream server closer to clusters of receivers to minimize the amount of network traffic. Other types of systems can use receivers to propagate the video as in peer-to-peer (P2P) implementations. Many variants of CDNs and associated services are commercially available in the market.
IP Multicast is another well-known technique for many-to-many communications over an IP infrastructure, as described in “IP Multicast Applications: Challenges & Solutions,” RFC 3170, IETF, http://www.ietf.org/rfc/rfc3170.txt and co-pending U.S. patent application Ser. No. 11/865,478. IP Multicast efficiently uses IP network infrastructure by requiring the source to send a packet only once, even if the packet needs to be delivered to a large number of receivers. The nodes in the network replicate the packet for delivery to multiple receivers only where necessary. Key concepts in IP Multicast include an IP Multicast group address, a multicast distribution tree, and receiver driven tree creation.
An IP Multicast group address is used by video sources and receivers to send and receive content. A source uses the group address as the IP destination address in their data packets. A receiver uses the group address to inform the network that it is interested in receiving packets sent to that group address. For example, if video content is associated with group 239.1.1.1, the source will send data packets destined for 239.1.1.1. Receivers for that content will inform the network that they are interested in receiving data packets sent to the group 239.1.1.1. The receiver “joins” 239.1.1.1.
Once the receivers join a particular IP Multicast group, a multicast distribution tree is constructed for that group. The protocol most widely used for this is Protocol Independent Multicast (PIM). PIM sets up multicast distribution trees such that a data packet from a sender to a multicast group reaches all receivers that have “joined” the group. There are many different flavors of PIM: Sparse Mode (SM), Dense Mode (DM), Source Specific Mode (SSM) and Bidirectional Mode (Bidir).
The distribution of video content in a massively scalable video conferencing session where there is only one video source (or few video sources) and a very large number of receivers (who do not send any video) can utilize a single SVCS, a distributed SVCS, or a plurality of cascaded SVCSs, as described in co-pending U.S. patent application Ser. No. 11/615,643 and U.S. Pat. No. 7,593,032. Unless otherwise noted, henceforth, the term “SVCS” refers to any of single, distributed, or cascaded SVCS.