Subject matter related to the present application can be found in the co-pending U.S. Provisional application Ser. No. 12/897,365, filed Oct. 4, 2010, and entitled “Automatic Temporal Layer Bit Allocation;” Ser. No. 12/015,956, filed Jan. 17, 2008 and entitled “System and Method for Scalable and Low-Delay Videoconferencing Using Scalable Video Coding;” Ser. No. 11/682,263, filed Mar. 5, 2007 and entitled “System and Method for Providing Error Resilience, Random Access and Rate Control in Scalable Video Communications;” and Ser. No. 11/608,776, filed Dec. 8, 2006 and entitled “Systems and Methods for Error Resilience and Random Access in Video Communication Systems”; as well as U.S. Pat. No. 7,593,032, filed Jan. 17, 2008 and entitled “System and Method for a Conference Server Architecture for Low Delay and Distributed Conferencing Applications.” All of the aforementioned related applications and patents are hereby incorporated by reference herein in their entireties.
It is important for an interactive video communications system (e.g., a video conferencing system) to optimize the one-way delay that is perceived by the user, to ensure a natural user experience. Several factors contribute to the total one-way delay, including, for example, video capture, pre-processing, encoding, packetization, network transmission, de-packetization, decoding, post-processing, and rendering times. The present invention relates to delay introduced by the network transmission time.
In the following description, the terms “link,” “link capacity,” “link bandwidth,” and so on are frequently used. These terms should be interpreted broadly. Specifically, the term “link” refers to any physical or virtual network connection between a sending unit and a receiving unit, based on any suitable physical infrastructure and protocol hierarchy. To make two examples, both an ISDN b channel and the virtual “connection” between two units over the Internet, using an IP/UDP/RTP protocol hierarchy, qualify as a “link.” If a more narrow interpretation of the term is intended, suitable attributes will accompany the term link.
Further, “link capacity” or “link bandwidth” refers to the mid-term average number of bits that can be transported over a link according to the above definition. “Mid-term” refers to a time interval of, for example, a few seconds to a few minutes—short enough to ensure that fluctuations in the available bit rate in the case of TCP-fairly behaving connections are covered by this definition. In most protocol hierarchies, the link capacity in the aforementioned sense is limited by four factors: (1) the capacity of the physical link(s) involved in the end-to-end connection, (2) the possibly static allocation of bit rates on this physical link (i.e., the allocation between audio and video bits in H.320-type video conferencing systems), (3) the typically dynamic allocation of bits between different applications' traffic (i.e., video conferencing and web access), and (4) bandwidth throttling mechanisms that are part of the protocol infrastructure (i.e., TCP fairness related bandwidth management). Depending on the system architecture, factors (2), (3) and (4) can be indistinguishable from each other.
What is common to all four categories is that modern protocol hierarchies allow for the estimation of the limiting factors, consequently enabling an application to determine, the amount of bits it can use over a reasonable time interval (for example, a few seconds) without triggering excessive failures such as very high packet loss rates. This amount of bits, divided by the time interval's duration, is herein referred to as “link capacity.” In a more a complex example, a video conferencing session running on a PC over the Internet to another similar setup involves all four factors: the limitations of the physical links and Internet routers, network traffic that is non-video conference related but conveyed over the same physical links and routers, fixed allocations of bits for the audio data (which must be conveyed with a higher priority than video bits), and both TCP fairness and multi-application use of, and between, the video transmission and other (i.e., web browsing) traffic.
The term “bandwidth-limited link” refers to any physical or virtual connection that is constrained by a link capacity in the aforementioned sense.
The time required to convey a coded picture of video sequence, as produced by an encoder, over a bandwidth-limited link in the above sense obviously varies with the size of the coded picture. Keeping the coded picture size as small as possible enables the lowest delay possible. However, in video coding, the more bits that are spent on a given picture, the better the resulting quality, which can be as desirable as low delay from a user's viewpoint. Assuming low delay is the overriding priority, in order to present a user the lowest possible delay (first priority) at the best possible quality (second priority), the coded picture size in bits has to be the inverse of the number of pictures transmitted per unit time multiplied by the link capacity, in bits, over the same time period. This situation is well known among those skilled in the art (see for example [“Rate control in DCT video coding for low-delay communications”, Ribas-Corbera, J, and Lei, Shawmin, IEEE Transactions on Circuits and Systems for Video Technology. Vol. 9, no. 1, pp. 172-185. February 1999]).
An example can illustrate the relationship of delay, coded picture size, and link bandwidth. Receivers may not display a picture until the coded picture has been received in its entirety, decoded, and possibly post-processed. Such a receiver is assumed in the following example. Further assumed is a bandwidth-limited link in the aforementioned sense. The percentage of the bandwidth of the link's capacity allocated for video can be less than 100% in the case of links shared between more than one user/application/service, or in the case of multimedia communication. Further, assumed are a fixed video frame rate and a desire for good picture quality. Under these constraints, the lowest end-to-end delay is achieved when all coded pictures are of the same size. Large fluctuations in coded picture sizes, and resulting fluctuations in the transmission times of the coded pictures, should be avoided. In order to play back video at the original frame rate, a receiver must buffer decoded pictures or coded video bits taking into account a pessimistic (i.e., worst likely case) assumption of the maximum number of bits for a coded picture. A mechanism that ensures no, or only small, fluctuation in coded picture size can permit the receiver to keep the buffers small, and be optimized for low delay.
FIG. 1 shows an exemplary video conferencing system. The system 100 can, for example, include a camera 110 that collects light and converts it to a digital video signal (i.e., a sequence of digital video pictures), a video encoder 120 that encodes the video pictures and places the resulting bitstream onto a network 125. A video decoder 130 decodes the bitstream received from network 125—that can be sent from a similar system—producing a sequence of digital video pictures that are rendered on display 140. The video encoder 120 and/or video decoder 130 can be implemented, at least in parts, on programmable hardware, such as a general purpose CPU, a DSP, or a similar device. In order to enable the operation of the programmable hardware, the system may utilize a computer readable media 145, such as Flash ROM, ROM, CD ROM, DVD ROM, hard drive, or memory stick, containing instructions arranged to enable the programmable hardware to execute mechanisms as discussed below. While the invention is described in the context of a video conferencing system, it should be obvious to a person skilled in the art that other forms of systems involving video transmission can also take advantage of the invention. For example, an entertainment quality video capture/coding/transmission system optimized for live transmissions can benefit from the invention as well. Further, in certain environments, the camera 110 can be replaced by any other source for digital, uncompressed picture data (such as a digital video tape recorder, or the output of a computer based renderer that creates video). Similarly, the display 140 can be replaced by one or more other devices that use uncompressed digital picture data, such as a digital video recorder, data projector, or similar devices. The network 125 can be any form of a digital data network of sufficient bandwidth and service quality to support the encoder 120 and decoder 130.
In this disclosure, the glass-to-glass delay of a video conferencing system in operation is the time interval between the instant in which light from a scene enters the camera 110, to the instant a corresponding picture of that scene is presented on the display 140, excluding network transit time. The glass-to-glass delay can be measured by placing an encoding video conferencing system and a decoding video conferencing system (or any other appropriate system, including traffic generators/analyzers and such) in close physical proximity to each other, and connecting them using a network that offers substantially the same traffic characteristics (in terms of bandwidth, MTU size, and similar parameters) as the target network, with a known transmission delay. The glass-to-glass delay is the time difference between the capture and display of a stimulus video signal, minus the known signal transmission delay.
This disclosure also refers to “one way delay,” or “end to end delay.” If used without any qualifiers (such as glass-to-glass), both terms refer to the glass-to-glass delay as defined above plus the transmission delay, i.e., they refer to the delay as observed by the remote user, irrespective of the technical factors that contribute to them.
In video conferencing scenarios involving RTP/RTCP or similar application layer protocol media transmission standards, the network delay can be determined by both encoding and decoding systems through the monitoring of the RTCP receiver reports in conjunction with the RTP sender reports. The details of this measurement technology can be found in IETF RFC 3550, available at http://www.ietf.org/rfc/rfc3550.txt.
In modern video coding standards, the content of the picture being encoded (hereafter called the “input picture”) is predicted and the encoder encodes the difference between the input picture and the prediction it creates. One prediction method, inter-picture prediction, builds a prediction for the input picture by referencing the contents of one or more other pictures in the video reproduced sequence. Traditional video conferencing systems can use, for example, the picture occurring immediately before the input picture as the reference from which the input picture prediction is created.
One technique that can advantageously be used in low delay video transmission applications is known as temporal scalability. Temporal scalability refers to techniques in which the prediction structure between pictures is chosen such that it is possible to drop a subset of the coded pictures (belonging to one or more temporal enhancement layers, also known as threads) from decoding without negatively affecting the inter picture prediction relationships. In the context of the invention, any threaded picture structure can be used, such as those described in co-pending U.S. patent application Ser. No. 12/015,956, including the degenerated threaded picture structure that has only a single thread (which is similar to tradition IPPPP video coding, in which there is a reference only to previous picture(s), and no temporal scalability). As the invention can advantageously be practiced in conjunction with temporal scalability, a brief introduction to one temporal layering scheme is included.
Depicted in FIG. 3 is a four-picture prediction structure. Bold vertical lines are used to depict video pictures (310, 320, 330, etc.), the arrows point from the picture being used as a motion compensated reference to the input picture, and time progresses from left to right. The video picture 310 can be an intra coded picture (i.e., a picture for which a prediction is built using content from spatially adjacent blocks within the same picture as opposed to inter-picture prediction) or can be predicted from an earlier picture not shown in FIG. 3. Picture 320 references picture 310 to create its prediction while picture 330 also uses picture 310 as its reference. Picture 340 uses picture 330 as a reference and picture 350 uses 310; the prediction structure repeats in this manner until the entire sequence is encoded. No picture references pictures 320, 340, 360, or 380. As such, these pictures can be removed from the coded video sequence, thereby reducing the video frame rate in half, without disrupting the prediction chain. In contrast, the removal of picture 310 directly breaks the prediction of all pictures using it as a reference, and indirectly breaks the prediction of all remaining pictures. Once pictures 340 and 380 are removed, pictures 330 and 370 are no longer used as references and can also be removed, reducing the frame rate in half again, without breaking the prediction chain. In this way, the prediction pattern imparts a hierarchical structure onto the pictures. The hierarchical structure is emphasized in FIG. 3 through the use of vertical offsets of the picture representations. Pictures 320, 340, 360, and 380 are often said to belong to the highest temporal layer, while pictures 310 and 350 belong to the temporal base layer. The prediction structure shown in FIG. 3 and others having similar hierarchical properties are referred to as a “hierarchical prediction structure” or “threaded prediction structure.” Other hierarchical prediction structures, such as ones with more or fewer temporal layers, longer prediction periods and/or including bi-predicted pictures (B-pictures), can also be advantageous; the choice of prediction structure is determined by the application and is not a subject of this disclosure.
In addition to the advantage that the frame rate can be reduced by simply removing certain coded pictures, the hierarchical prediction structure imparts a natural error resilience to the coded bitstream since transmission errors in non-base layer pictures (three out of every four pictures in the prediction structure shown in FIG. 3) are not propagated indefinitely as they would be in a more traditional non-hierarchical prediction structure. Examples designed to exploit the hierarchical prediction structure are described in co-pending U.S. patent application Ser. Nos. 11/608,776 and 11/682,263.
In practical low delay systems, such as video conferencing systems, the user experience can be greatly enhanced by adequately balancing the legitimate desires of both coding efficiency (which can lead to large fluctuations in the size of coded pictures) and low delay (which requires a uniform size of coded pictures). Therefore, it is advantageous to include a mechanism to control coded picture size as a function of the measured one-way delay.