1. Field of the Invention
This invention relates to video teleconferencing, and in particular, to generating a composite video from video sources having different picture rates.
2. Background Information
A video teleconference, as its name implies, is a conference in which several audio-visual terminals located remotely from each other participate. In one instance, the videoconferencing system allows for the simultaneous exchange of video, audio, and other data between terminals. As FIG. 1 shows, an example of such a system is a plurality of interconnected terminals 11, 12, 15, and 16. For the sake of example, the drawing shows the transmission medium as including an Integrated Services Digital Network (ISDN), and a Transport Control Protocol/Internet Protocol (TCP/IP) network. In other words, videoconferencing can be performed by way of packet-switched networks as well as circuit-switched networks. A gateway 22 translates between protocols in the example.
A multipoint control unit (MCU) 20 receives signals from the various terminals, processes these signals into a form suitable for video teleconferencing, and retransmits the processed signals to the appropriate terminals. For example, the video signals from the various terminals may be spatially mixed to form a composite video signal that, when it is decoded, may display the various teleconference participants in one terminal. Usually, each terminal has a codec to encode video, audio and/or data signals to send to the MCU for appropriate distribution and to decode such signals from the MCU. Codes for this purpose are well known in the art and are exemplified, for instance, in the International Telecommunication Union (ITU) Telecommunication Standardization Sector recommendation document H.261 (ITU-T Recommendation H.261).
The Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) is responsible for standardizing the technical aspects of telecommunication on a worldwide basis. Its H-series recommendations concern video teleconferencing H-series. Among other H-series recommendations, H.221 defines frame structure, H.261 defines video coding and decoding, H.231 defines multipoint control units (MCUs), H.320 defines audio-visual terminals, and H.323 defines audio-visual terminals that do not provide a guaranteed quality of service. How the various devices in the video teleconferencing system interact with each other using the various recommendations are now briefly described.
The H.320 terminals employed in the system transmit H.221 frames of multiplexed audio-video and data information. (These frames should not be confused with video frames, which we will hereafter refer to as xe2x80x9cpicturesxe2x80x9d to distinguish them from transmission frames.) Each frame consists of one or more channels, each of which comprises 80 octets of bits, and each of the 8 octet bit positions can be thought of as a separate sub-channel within the frame. In general, certain bits of a given octet will contain video information, certain bits will contain audio information, and certain bits may contain data, as FIG. 2""s first row illustrates. Additionally, the eighth bit in certain of a frame""s octets (not shown in the drawings) represents control information by which, among other things, frame boundaries can be recognized. The precise bit allocation is determined through a session negotiation process among the involved video teleconferencing terminals.
The H.323 terminals employed in the system use the real-time transmission protocol (RTP), known to one skilled in the art, and set forth in the Request For Comments (RFC) 1889. RFCs are published by the Internet Engineering Task Force (IETF), a community dedicated to standardizing various aspects of the Internet. An H.323 terminal uses separate RTP sessions to communicate the conferenceands video and audio portions. Thus, as FIG. 2""s first through third rows show, a gateway""s option of translating from H.221 to RTP involves demultiplexing the H.221 data stream into its video, audio, and data constituents so that the gateway can packetize the video, audio, and data separately. In particular, video bits are extracted from a session of octets and concentrated into a stream that contains only the H.221 transmission""s video parts. The stream is encoded in accordance with H.261 recommendation at the terminal using a codec. Note that the encoding may be in accordance with a related H.263 recommendation. However, the H.261 recommendation will generally be focused on here.
FIG. 3 illustrates a typical link layer packet suitable for transmission in accordance with the RTP protocol. If Ethernet is used for the link layer, information is sent to an Ethernet frame that begins and ends with an Ethernet header and trailer, which are used for sending the information to the next stop on the same local network. The frame""s contents are in IP datagram, which also includes its own header, specified in RFC 791, for directing the datagram to its ultimate internetwork address. In video conference situations, RTP permits TCP to be used as the transport protocol (i.e., as the protocol for is directing the information to the desired application at the destination internet address). However, the User Datagram Protocol (UDP) is preferable to TCP for videoconferencing because TCP""s re-transmission of lost video streams is unnecessary under these situations. Thus, FIG. 3 depicts the IP payload as a UDP datagram and includes a UDP header as specified in RFC 768.
Because packet-switched protocol data units do not in general arrive in order, and because real-time information must be presented in a predetermined time sequence, the UDP payload must include information specifying the sequence in which the information was sent and its real-time relationship to other packets. So the payload begins with an RTP header, specified in RFC 1889, that gives this and other information.
The RTP header format, depicted in FIG. 4, is shown as successive four-byte rows. RFC 1889 describes the various FIG. 4 fields"" purposes in detail, so only the timestamp field is mentioned here. When information travels by way of a packet-switched network, different constituent packets make their ways to their common destination independently. That is, different packets can take different routes, so the times required for different packets to arrive at their respective destinations are not in general the same, and packets can arrive out of sequence or in time relationships that otherwise differ from those with which their contained information was generated. RTP therefore provides for a time-stamp in each packet to indicate the real-time relationships with which the information is to be played. Typically, gateways and H.323 devices (e.g., terminals and MCUs) use a local clock to provide the RTP-required timestamp as they assemble H.261 packets.
However, it would be complicated to play the resultant timestamp information if no notice were taken of the actual contents of the data stream being packetized. For example, a single packet could contain parts of two different video pictures, so parts of the same picture would have the same timestamp, while different parts would have different timestamps. To avoid this, the packets need to be monitored for picture boundaries.
FIG. 2""s fourth through seventh rows depict the structure that the incoming data stream uses to represent successive video pictures in accordance with H.261. The fourth row illustrates a data-stream portion covering a single video picture. It shows that the portion begins with a header, and FIG. 5 illustrates that header""s structure.
The header field of importance here is the Picture Start Code (PSC). For H.261 streams, that field value is always 00010H, a sequence that cannot occur elsewhere in the data stream. If a length of a single-picture portion of the data stream exceeds the underlying protocol""s maximum-transmission-unit size, the H.323 device breaks the single picture""s data into multiple packets. For such packets, the timestamp entered is the same as that assigned to the last PSC-containing packet. In those instances, RFCs such as RFC 2032 entitled xe2x80x9cRTP Payload Format for H.261 Video Streamsxe2x80x9d and RFC 2190 titled xe2x80x9cRTP Payload Format for H.263 Video Streams,xe2x80x9d both of whose contents are well known to those skilled in this art, specify how the picture""s data should be packetized. Packetization can be appreciated by first reviewing the picture data""s finer structure.
As FIG. 2""s fourth row indicates, the picture data""s body portion is divided into xe2x80x9cgroups of blocksxe2x80x9d (GOBs). H.261 specifies a Common Intermediate Format (CIF) in which each GOB represents one-twelfth of the resultant picture area, in a spatial relation-ship that FIG. 6 illustrates. H.261 also specifies an alternative, more-sparsely sampled quarter-CIF (QCIF) format. When QCIF is employed, each GOB represents one-third of the total picture area, as FIG. 7 illustrates.
FIG. 2""s fourth row depicts the GOB fields as being unequal in length. This is because the degree of H.261-specified data compression depends on the source picture""s data redundancy, which can differ from region to region.
FIG. 2""s fifth row shows that each GOB field has its own header, and FIG. 8 illustrates a GOB header""s structure. The GOB header begins with a Group-of-Blocks Start Code (GBSC). That code""s value is 0001H, a sequence that cannot occur elsewhere (except in the PSC).
The GOB""s Group Number (GN in FIG. 8) follows the GBSC code and specifies the GOB region""s position in accordance with the scheme shown in FIG. 6 or FIG. 7. Next is a default quantization value GQUANT, which influences the contained data""s interpretation by specifying the magnitude intervals at which the values were quantized. The header may additionally contain further, optional fields. FIG. 2""s fifth row shows that a GOB is divided into so-called macroblocks, which correspond to subregions within the GOB regions. FIG. 9 illustrates a single-GOB picture segment""s division into subregions represented by respective macroblocks. Although there are thirty-three such subregions in a GOB-represented region, FIG. 2 depicts somewhat fewer macroblocks than that, because macroblocks that are redundant in view of previous macroblocks can be omitted in accordance with H.261. (As those familiar with the H.261 specification will recognize, previous may have either a temporal or a spatial meaning; that specification admits of a variety of data-compression techniques.)
FIG. 2""s sixth row shows that each macroblock has its own header, and FIG. 10 illustrates that header""s structure. The header""s MacroBlock Address (MBA) field contains a variable-length code for the difference between the current macroblock""s address and that of the previously sent GOB""s block (since not all macroblocks are sent for every GOB). The MTYPE field specifies the manner in which the current macroblock""s data were encoded; the data may be the result of comparing the raw data with a neighbor macroblock""s data, with the corresponding data from a previous picture, with filtered versions of either of them, etc. If an MQUANT field is present, its contents supersede the default quantization that the GQUANT field in the enclosing GOB""s header specifies.
The CBP field specifies the macroblock""s constituent xe2x80x9cblocksxe2x80x9d for which the macroblock field contains data. There are at most six such blocks. The first four represent the luminance (Y) information from respective segments of a macroblock subregion divided as FIG. 11""s left rectangle illustrates. The fifth and sixth block fields represent more-sparsely sampled blue (CB) and red (CR) color-difference values covering the whole macroblock region, as FIG. 11""s center and right rectangles indicate. Each block field""s contents are coefficients of an 8xc3x978 discrete cosine transform of the data that remain after any subtraction by previous-image data.
Continuous Presence designed in accordance with the H.261 recommendation allows for several sites to be seen simultaneously on one screen. In one example, it takes advantage of the following characteristics of the H.261 data stream and the H.323 endpoints. The H.261 recommendation specifies that two picture scanning formats (i.e., CIF is and QCIF) may be used by the codec. Notably, the H.323 end-points could send pictures in QCIF while receiving CIF pictures. Thus, a multipoint control unit (MCU) could select the four most appropriate sites, receive the QCIF pictures from those sites, form a composite picture from them, and distribute the composite picture to the conferencing end-points. In this manner, four conferencing sites may be viewed on the screen.
However, the sites transmitting the QCIF pictures are usually not within the control of the MCU. Stated differently, the terminals at the sites operate independently and transmit pictures at different rates according to their preferences. Also, the burstiness of the video packets, perhaps due to network jitter, may cause multiple pictures to arrive at the MCU at a time from the sites.
In accordance with the invention, the MCU has the ability to handle different picture rates from end-points. As the QCIF pictures are completed for the sites to be viewed, they are queued in their respective queues in the memory of the MCU. Each queue represents a quadrant of the CIF picture. The queues have respective put pointers and get pointers that indicate where in the queue a QCIF picture is to be placed and retrieved respectively. Because one or more queues may receive pictures at different rates, the MCU employs a thread of execution, called a xe2x80x9cput threadxe2x80x9d to accommodate these different rates. Each time the put thread xe2x80x9cwakes upxe2x80x9d, it determines whether any complete QCIF pictures have been assembled since the last thread activation and if any, the put thread stores the picture at the respective queue in the position pointed by the respective put pointer and then updates the pointer. Another thread, the get thread, controls retrieval from the queues, and it is typically activated at a different rate. For instance, if the composite picture is transmitted at 30 pictures/sec., the get thread wakes up at 30 pictures/sec. to determine if there are pictures in the queue to be formed into a composite picture. The put thread retrieves the pictures from their queues at intervals that correspond to their transmission rate.
For the MCU to be able to handle these different rates, it monitors the incoming packets of each quadrant for temporal information. From the information, the MCU knows the picture rates for each quadrant and programs the get thread to retrieve pictures from their respective queues according to their transmission rates.