1. Field of the Invention
The present invention relates generally to video conference systems, and more particularly to decreasing end-to-end delay during video conferencing sessions.
2. Description of the Related Art
The well-known National Television Standards Committee (NTSC) and Phase Alternating Line (PAL) television standards are employed by video cameras and monitors to capture and display video information for consumer applications. Both NTSC and PAL cameras and monitors capture and display video information in an interlaced format. Interlacing refers to a method of capturing two fields of video information per frame. One half of a vertical resolution of a frame (i.e., every other horizontal line) is captured in a first or “top” field. A remaining half of the vertical resolution of the frame is captured in a second or “bottom” field. Each frame of a video picture produced by the NTSC camera or displayed by the NTSC monitor is displayed in a 480-line format with each line having 720 pixels, while the PAL format is displayed in 576 lines. The NTSC video is transmitted at 60 frames per second and, the PAL video is transmitted at 50 frames per second. Adaptations of these formats have been adopted for emerging high-definition television as well.
Typically, the NTSC or PAL cameras and monitors are used in conjunction with video conferencing systems that implement the International Telecommunications Union (ITU) Telecommunications (ITU-T) H.263 standard (incorporated herein by reference in its entirety, including all annexes, appendices, and subparts thereof), since such devices are much less expensive than equipment that captures video information using progressive (non-interlaced) scan technology. Until recently, however, the H.263 standard did not directly support interlaced video transmission, but supported Common Intermediate Format (CIF), which is a non-interlaced frame consisting of 288 lines of 352 pixels each. Transmission rate for CIF video can be as high as 30 frames per second. Thus, video conference systems had to convert from NTSC (or PAL) into CIF before coding each input video frame. Such a conversion discards some spatial and temporal information, and thus degrades the picture quality. In this context the “spatial information” is the pixels in both vertical and horizontal directions that are not included in the CIF frame. Likewise, the discarded “temporal information” represents the fact that a 50 or 60 frame per second (fps) transmission of the NTSC or PAL standard is down-sampled to 30 fps in the CIF format.
In recent years, cost of hardware and transmission bandwidth required for coding and transmitting interlaced video pictures has decreased. It is now considered economically practical for a video conferencing system to code interlaced pictures with a full spatial dimension of NTSC or PAL input sources. The ITU has addressed this change in technology by adding Annex W to the H.263 standard.
Annex W describes how interlaced video signals can be encoded and decoded when transmitted in a single stream (or channel) of video information. The Annex W video encoding (or simply “coding”) scheme utilizes a reference frame from one field to predict a picture of another field. However, a top field in an interlaced video transmission scheme is a poor predictor of a bottom field and vice versa. Thus, using the top field to predict the bottom field can lead to poor picture quality during times of low motion.
This particular form of picture quality degradation is due to the fact that the camera creates a complete picture frame by first scanning for top field information and then scanning for bottom field information. Each field is thus separated spatially (by one line) and temporally (by the refresh period between the end of the top field and the end of the bottom field). This temporal and spatial separation can result in display jitter, which is more noticeable during times of low motion. With this problem in mind, Section W.6.3.11 of Annex W suggests that Annexes N or U of H.263 can be used to predict from more than one previous field. For example, two or three previous fields can be used to form a prediction of the next field. In particular, the field (or fields) to be used for prediction can be chosen (according to Annexes N or U) such that each top field is always predicted from the previous top field (or fields) and each bottom field is always predicted from the previous bottom field (or fields). In this way, the top field can be coded and transmitted in a stream completely separate from the stream containing the bottom field. Using the video information from the same field for prediction thus mitigates the picture quality problem described above.
This field prediction scheme is also more resilient to errors. If one stream of video information is temporarily dropped, the other stream can continue. Since one field remains, there is always some video information to decode and display, albeit at a slower update rate.
Further, more than one processor may be used to more efficiently encode a video stream in a multiple-processor architecture. For example, one processor can code the stream of top fields, and a second processor can code the stream of bottom fields, where each processor is programmed to capture and encode either the top or bottom field of video information. Each processor may receive both streams of top or bottom fields and decode one. Conversely, the video conferencing system may be configured such that each processor only receives one of the field streams.
Several shortcomings exist in the above-described systems. Firstly, dropped fields, caused by large amounts of motion or by transmission errors occurring in any one of the video signal transmission streams, can affect the quality of the displayed picture for an extended period of time. In such cases, the picture quality remains poor until the coding process recovers. For instance, if a field of information is lost during transmission for any reason, and a decoder signals an encoder to encode an “Intra” field (the use of Intra fields described within the H.263 standard), the quality of that half of the picture (i.e., the lost field) will suffer for a period of time that it takes the encoder to recover from the error and/or encode the Intra frame.
Another shortcoming of prior art systems is that the field that the encoder begins encoding with (at start up) is indeterminate. The receiving video conference system does not know a priori whether the first frame to be received will begin with a top field or a bottom field. This is so because, at the transmitting video conference terminal, the video camera starts generating and sending fields of video information before the encoder is ready to receive the information. After the encoder is itself initialized, the encoder begins processing at the beginning of the next field it sees.
This situation can cause additional and unacceptable transmission delay. If the received video stream begins with the same field that the encoder was initialized to expect, there are no problems and no added delay in subsequent encoding. If, however, the encoder receives the opposite field than the one that is expected, the encoder will wait (i.e. delay) for as much as an entire field capture time (e.g. 16.7 milliseconds) in order to receive and store the expected field. This image delay will prevail for the entire video conferencing session. Such a systematic delay can lead to unacceptable meeting dynamics and misunderstood conversations.
In a dual processor implementation, each processor is programmed to capture and encode either the top or the bottom field of video information (i.e. each processor receives both fields of video, however, both fields are not captured and encoded). Generally, at system start time, the encoder randomly sends either the top or bottom field of video information first. Specifically, at the time that the video conferencing system is started, either the top or the bottom field of video can be available to either of the two processors. This is because the video camera starts generating and sending fields of video information prior to the processors being ready to receive video information, and the processors will capture the first field that is available after initialization.
The first field that the decoder receives can be indeterminate for other reasons as well. For instance, bit errors contained in a field can also cause the field to be dropped at the decoder or lost in the network. At startup, an interrupt is generated by the decoder which has an effect of preparing the decoder to receive either the top or bottom field of video (actually the routine that services this interrupt determines which field the pointer will be initialized to). In some systems, one interrupt is generated every 16.7 msec (NTSB) or 20 msec (PAL), which is a period of time it takes to display one field of information. As a result of this interrupt, a display buffer pointer is set to a particular memory location. This location could, for instance, correspond to a first line (i.e., line 0) of the top field of video information. During normal operation, the display buffer pointer is changed by the processor whenever the processor services the interrupt. This interrupt is generated during a vertical blanking period (i.e., the period during which the monitor scanning moves back to the top of the display screen). The receipt and servicing of this interrupt results in the pointer being moved from a starting position (i.e., either top or bottom field location) to a second position (i.e., either bottom or top field location, respectively). Disadvantageously, if the first field that the encoder captures, encodes, and transmits is not the field that the decoder buffer pointer was initialized to, then the decoder must wait one full encoder capture period (e.g., 16.7 msec) for the next field to arrive. This wait adds 16.7 msec of end-to-end video delay to the system. When the total end-to-end video delay ranges from 150 to 200 msec due to bandwidth availability and network delay, removing 16.7 msec is significant.
Since the first field that the decoder receives at the start of a video conference session is not determinate, the decoder may have to wait one field capture time (e.g., 16.7 msec) to store the next field in the display buffer, therefore delaying display of the image. This video image delay prevails for the entire video conferencing session.
One of the main problems with end-to-end video delay is that the delay affects video meeting dynamics. One example of a meeting dynamics problem is if a local person makes a statement and is watching a remote meeting participant waiting for a response and the response is delayed to a point that the local person is not sure whether or not the remote participant understood the statement. Another example is if the local person is listening to the participant and is also waiting for an opportunity to break in to ask a question. If, at the same time, a second remote person is also waiting to break in, in all probability, the second remote person will do so before the local person is aware that the first remote participant has stopped talking. So, in effect, people interrupt one another during a meeting in an “uncontrolled” manner. As this is the case, it is very desirable to have the end-to-end delay time be as short as possible, therefore giving the meeting as “natural” a feeling as possible.
Therefore there is a need for a method that avoids introduction of additional delay in a video conferencing session.