1. Field of the Invention
The present invention relates generally to videoconference systems, and more particularly to providing video quality improvement.
2. Description of Related Art
The well-known National Television Standards Committee (NTSC) and Phase Alternating Line (PAL) television standards are employed by video cameras and monitors to capture and display video information for consumer applications. Both NTSC and PAL cameras and monitors capture and display video information in an interlaced format. Interlacing refers to a method of capturing two fields of video information per frame. One half of a vertical resolution of a frame (i.e., every other horizontal line) is captured in a first or “top” field. A remaining half of the vertical resolution of the frame is captured in a second or “bottom” field. Each frame of a video picture produced by the NTSC camera or displayed by the NTSC monitor is displayed in a 525-line format with each line having 720 pixels, while the PAL format is displayed in 625 lines. The NTSC video is transmitted at 60 frames per second, and the PAL video is transmitted at 50 frames per second. Adaptations of these formats have been adopted for emerging high-definition television as well.
Typically, the NTSC or PAL cameras and monitors are used in conjunction with video conferencing systems that implement the International Telecommunications Union (ITU) Telecommunications (ITU-T) H.263 standard (incorporated herein by reference in its entirety, including all annexes, appendices, and subparts thereof), since such devices are much less expensive than equipment that captures video information using progressive (non-interlaced) scan technology. Until recently, however, the H.263 standard did not directly support interlaced video transmission. The most common format used under H.263 before recent changes was the Common Intermediate Format (CIF), which is a non-interlaced frame consisting of 288 lines of 352 pixels each. Transmission rate for CIF video can be as high as 30 frames per second. Thus, videoconference systems had to convert from NTSC (or PAL) into CIF before coding each input video frame. Such a conversion discards some spatial and temporal information, and thus degrades the picture quality. In this context the “spatial information” is the pixels in both a vertical and a horizontal directions that are not included in the CIF frame. Likewise, the discarded “temporal information” represents the fact that a 60 or 50 frame per second (fps) transmission of the NTSC or PAL standard is down-sampled to 30 fps in the CIF format.
In recent years, cost of hardware and transmission bandwidth required for coding and transmitting interlaced video pictures has decreased. It is now considered economically practical for a video conferencing system to code interlaced pictures with a full spatial dimension of NTSC or PAL input sources. The ITU has addressed this change in technology by adding Annex W to the H.263 standard.
Annex W describes how interlaced video signals can be encoded and decoded when transmitted in a single stream (or channel) of video information. The Annex W video encoding (or simply “coding”) scheme utilizes a reference frame from one field to predict a picture of another field. However, a top field in an interlaced video transmission scheme is a poor predictor of a bottom field and vice versa. Thus, using the top field to predict the bottom field can lead to poor picture quality during times of low motion.
This particular form of picture quality degradation is due to the fact that the camera creates a complete picture frame by first scanning for top field information and then scanning for bottom field information. Each field is thus separated spatially (by one line) and temporally (by the refresh period between the end of the top field and the end of the bottom field). This temporal and spatial separation can result in display jitter, which is more noticeable during times of low motion. With this problem in mind, Section W.6.3.11 of Annex W suggests that Annexes Nor U of H.263 can be used to predict from more than one previous field. For example, two or three previous fields can be used to form a prediction of the next field. In particular, the field (or fields) to be used for prediction can be chosen (according to Annexes N or U) such that each top field is always predicted from the previous top field (or fields) and each bottom field is always predicted from the previous bottom field (or fields). In this way, the top field can be coded and transmitted in a stream completely separate from the stream containing the bottom field. Using the video information from the same field for prediction thus mitigates the picture quality problem described above.
This field prediction scheme is also more resilient to errors. If one stream of video information is temporarily dropped, the other stream can continue. Since one field remains, there is always some video information to decode and display, albeit at a slower update rate.
Further, more than one processor may be used to more efficiently encode a video stream in a multiple-processor architecture. For example, one processor can code the stream of top fields, and a second processor can code the stream of bottom fields; each processor is programmed to capture and encode either the top or bottom field of video information. Additionally, each processor may receive both streams of top or bottom fields and decode one. Conversely, the video conferencing system may be configured such that each processor only receives one of the field streams.
Several shortcomings exist in the above-described systems. Firstly, dropped fields caused by large amounts of motion or by transmission errors occurring in anyone of the video signal transmission streams can affect the quality of the displayed picture for an extended period of time. In such cases, the picture quality remains poor until the coding process recovers. For instance, if a field of information is lost during transmission for any reason, and a decoder signals an encoder to encode an “Intra” field (the use of Intra fields being described within the H.263 standard), the quality of that half of the picture (i.e., the lost field) will suffer for a period of time that it takes the encoder to recover from the error and/or encode the Intra frame.
Another shortcoming of prior art systems is that the field that the encoder begins encoding with (at start up) is indeterminate. The receiving videoconference system does not know a priori whether the first frame to be received will begin with a top field or a bottom field. This is so because, at the transmitting video conferencing terminal, the video camera starts generating and sending fields of video information before the encoder is ready to receive the information. After the encoder is itself initialized, the encoder begins processing at the beginning of the next field it sees.
This situation can cause additional and unacceptable transmission delay. If the received video stream begins with the same field that the encoder was initialized to expect, there are no problems and no added delay in subsequent encoding. If, however, the encoder receives the opposite field than the one that is expected, the encoder will wait (i.e., delay) for as much as an entire field capture time (e.g., 16.7 milliseconds) in order to receive and store the expected field. This image delay will prevail for the entire video conferencing session. Such a systematic delay can lead to unacceptable meeting dynamics and misunderstood conversations.
Therefore, there is a need for a video conferencing encoding and decoding method for interlaced video that immunizes against quality degradation in video signals and avoids introduction of additional delay.