1. Field of the Invention
This invention relates to video teleconferencing, and in particular, to complying with the maximum-transmission-unit size supported by the underlying transport mechanism.
2. Background Information
A video teleconference, as its name implies, is a conference in which several audio-visual terminals located remotely from each other participate. In one instance, the videoconferencing system allows for the simultaneous exchange of video, audio, and other data between terminals. As FIG. 1 shows, an example of such a system is a plurality of interconnected terminals 11, 12, 15, and 16. For the sake of example, the drawing shows the transmission medium as including an Integrated Services Digital Network (ISDN), and a Transport Control Protocol/Internet Protocol (TCP/IP) network. In other words, videoconferencing can be performed by way of packet-switched networks as well as circuit-switched networks. A gateway 22 translates between protocols in the example.
A multipoint control unit (MCU) 20 receives signals from the various terminals, processes these signals in to a form suitable for video teleconferencing, and re-transmits the processed signals to the appropriate terminals. For example, the video signals from the various terminals may be spatially mixed to form a composite video signal that, when it is decoded, may display the various teleconference participants in one terminal. Usually, each terminal has a codec to encode video, audio and/or data signals to send to the MCU for appropriate distribution and to decode such signals from the MCU. Codes for this purpose are we own in the art and are exemplified, for instance, in the International Telecommunication Union (ITU) Telecommunication Standardization Sector recommendation document H.261 (ITU-T Recommendation H.261).
The Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) is responsible for standardizing the technical aspects of telecommunication on a worldwide basis. Its H-series recommendations concern video teleconferencing H-series. Among other H-series recommendations, H.221 defines frame structure, H.261 defines video coding and decoding, H.231 defines multipoint control units (MCUs), H.320 defines audio-visual terminals, and H.323 defines audio-visual terminals that do not provide a guaranteed quality of service. How the various devices in the video teleconferencing system interact with each other using the various recommendations are now briefly described.
The H.320 terminals employed in the system transmit H.221 frames of multiplexed audio-video and data information. (These frames should not be confused with video frames, which we will hereafter refer to as xe2x80x9cpicturesxe2x80x9d to distinguish them from transmission frames.) Each frame consists of one or more channels, each of which comprises 80 octets of bits, and each of the 8 octet bit positions can be thought of as a separate sub-channel within the frame. In general, certain bits of a given octet will contain video information, certain bits will contain audio information, and certain bits may contain data, as FIG. 2""s first row illustrates. Additionally, the eighth bit in certain of a frame""s octets (not shown in the drawings) represents control information by which, among other things, frame boundaries can be recognized. The precise bit allocation is determined through a session negotiation process among the involved video teleconferencing terminals.
The H.323 terminals employed in the system use the real-time transmission protocol (RTP), known to one skilled in the art, and set forth in the Request For Comments (RFC) 1889. RFCs are published by the Internet Engineering Task Force (IETF), a community dedicated to standardizing various aspects of the Internet. An H.323 terminal uses separate RTP sessions to communicate the conference""s video and audio portions. Thus, as FIG. 2""s first through third rows show, a gateway""s option of translating from H.221 to RTP involves demultiplexing the H.221 data stream into its video, audio, and data constituents so that the gateway can packetize the video, audio, and data separately. In particular, video bits are extracted from a session of octets and concentrated into a stream that contains only the H.221 transmission""s video parts. The stream is encoded in accordance with H.261 recommendation at the terminal using a codec. Note that the encoding may be in accordance with a related H.263 recommendation. However, the H.261 recommendation will generally be focused on here.
FIG. 3 illustrates a typical link layer packet suitable for transmission in accordance with the RTP protocol. If Ethernet is used for the link layer, information is sent to an Ethernet frame that begins and ends with an Ethernet header and trailer, which are used for sending the information to the next stop on the same local network. The frame""s contents are in IP datagram, which also includes its own header, specified in RFC 791, for directing the datagram to its ultimate internetwork address. In video conference situations, RTP permits TCP to be used as the transport protocol (i.e., as the protocol for directing the information to the desired application at the destination internet address). However, the User Datagram Protocol (UDP) is preferable to TCP for videoconferencing because TCP""s re-transmission of lost video streams is unnecessary under these situations. Thus, FIG. 3 depicts the IP payload as a UDP datagram and includes a UDP header as specified in RFC 768.
Because packet-switched protocol data units do not in general arrive in order, and because real-time information must be presented in a predetermined time sequence, the UDP payload must include information specifying the sequence in which the information was sent and its real-time relationship to other packets. So the payload begins with an RTP header, specified in RFC 1889, that gives this and other information.
The RTP header format, depicted in FIG. 4, is shown as successive four-byte rows. RFC 1889 describes the various FIG. 4 fields"" purposes in detail, so only the timestamp field is mentioned here. When information travels by way of a packet-switched network, different constituent packets make their ways to their common destination independently. That is, different packets can take different routes, so the times required for different packets to arrive at their respective destinations are not in general the same, and packets can arrive out of sequence or in time relationships that otherwise differ from those with which their contained information was generated. RTP therefore provides for a timestamp in each packet to indicate the real-time relationships with which the information is to be played. Typically, gateways and H.323 devices (e.g., terminals and MCUs) use a local clock to provide the RTP-required timestamp as they assemble H.261 packets.
However, it would be complicated to play the resultant timestamped information if no notice were taken of the actual contents of the data stream being packetized. For example, a single packet could contain parts of two different video pictures, so parts of the same picture would have the same timestamp, while different parts would have different timestamps. To avoid this, the packets need to be monitored for picture boundaries.
FIG. 2""s fourth through seventh rows depict the structure that the incoming data stream uses to represent successive video pictures in accordance with H.261. The fourth row illustrates a data-stream portion covering a single video picture. It shows that the portion begins with a header, and FIG. 5 illustrates that header""s structure.
The header field of importance here is the Picture Start Code (PSC). For H.261 streams, that field value is always 00010H, a sequence that cannot occur elsewhere in the data stream. If a length of a single-picture portion of the data stream exceeds the underlying protocol""s maximum-transmission-unit size, the H.323 device breaks the single picture""s data into multiple packets. For such packets, the timestamp entered is the same as that assigned to the last PSC-containing packet. In those instances, RFCs such as RFC 2032 entitled xe2x80x9cRTP Payload Format for H.261 Video Streamsxe2x80x9d and RFC 2190 titled xe2x80x9cRTP Payload Format for H.263 Video Streams,xe2x80x9d both of whose contents are well known to those skilled in this art, specify how the picture""s data should be packetized. Packetization can be appreciated by first reviewing the picture data""s finer structure.
As FIG. 2""s fourth row indicates, the picture data""s body portion is divided into xe2x80x9cgroups of blocksxe2x80x9d (GOBs). H.261 specifies a Common Intermediate Format (CIF) in which each GOB represents one-twelfth of the resultant picture area, in a spatial relationship that FIG. 6 illustrates. H.261 also specifies an alternative, more-sparsely sampled quarter-CIF (QCIF) format. When QCIF is employed, each GOB represents one-third of the total picture area, as FIG. 7 illustrates.
FIG. 2""s fourth row depicts the GOB fields as being unequal in length. This is because the degree of H.261-specified data compression depends on the source picture""s data redundancy, which can differ from region to region.
FIG. 2""s fifth row shows that each GOB field has its own header, and FIG. 8 illustrates a GOB header""s structure. The GOB header begins with a Group-of-Blocks Start Code (GBSC). That code""s value is 0001H, a sequence that cannot occur elsewhere (except in the PSC).
The GOB""s Group Number (GN in FIG. 8) follows the GBSC code and specifies the GOB region""s position in accordance with the scheme shown in FIG. 6 or FIG. 7. Next is a default quantization value GQUANT, which influences the contained data""s interpretation by specifying the magnitude intervals at which the values were quantized. The header may additionally contain further, optional fields. FIG. 2""s fifth row shows that a GOB is divided into so-called macroblocks, which correspond to subregions within the GOB regions. FIG. 9 illustrates a single-GOB picture segment""s division into subregions represented by respective macroblocks. Although there are thirty-three such subregions in a GOB-represented region, FIG. 2 depicts somewhat fewer macroblocks than that, because macroblocks that are redundant in view of previous macroblocks can be omitted in accordance with H.261. (As those familiar with the H.261 specification will recognize, previous may have either a temporal or a spatial meaning; that specification admits of a variety of data-compression techniques.)
FIG. 2""s sixth row shows that each macroblock has its own header, and FIG. 10 illustrates that header""s structure. The header""s MacroBlock Address (MBA) field contains a variable-length code for the difference between the current macroblock""s address and that of the previously sent GOB""s block (since not all macroblocks are sent for every GOB). The MTYPE field specifies the manner in which the current macroblock""s data were encoded; the data may be the result of comparing the raw data with a neighbor macroblock""s data, with the corresponding data from a previous picture, with filtered versions of either of them, etc. If an MQUANT field is present, its contents supersede the default quantization that the GQUANT field in the enclosing GOB""s header specifies.
The CBP field specifies the macroblock""s constituent xe2x80x9cblocksxe2x80x9d for which the macroblock field contains data. There are at most six such blocks. The first four represent the luminance (Y) information from respective segments of a macroblock subregion divided as FIG. 11""s left rectangle illustrates. The fifth and sixth block fields represent more-sparsely sampled blue (CB) and red (CR) color-difference values covering the whole macroblock region, as FIG. 11""s center and right rectangles indicate. Each block field""s contents are coefficients of an 8xc3x978 discrete cosine transform of the data that remain after any subtraction by previous-image data.
The RTP specification suggests that the RTP packets sent be smaller than the largest packet supported by the underlying transport mechanism. For UDP/IP over Ethernet 1500 bytes per packet is typically set for efficient packet send rate and packet overhead minimization. Various terminals take this as a maximum size. The mentioned RFCs (i.e., RFC 2032 and RFC 2190) comment that where a video picture is too large for a packet, it shall be broken at a GOB or a macroblock boundary. In instances where a GOB size itself is larger than 1500 bytes, the RFCs suggest that the GOB be broken at the macroblock boundary. An end-point such as a terminal needs only a codec in order to perform the macroblock fragmentation. Because the codec knows the start and end of each macroblock boundary and size during the encoding of the video stream, macroblock parsing during packetization can be performed as a natural outcome of the encoding process. Macroblock parsers are known and may be constructed in accordance with RFC 2032 or RFC 2190.
However, macroblock parsing may not be that easily implemented in other devices, for example, an MCU. The MCU needs to first decode the packets into its video stream constituents before it is able to perform similar macroblock parsing (when encoding). However, the MCU performs numerous tasks within a video teleconferencing system, including the multitasking of a plurality of concurrent conferences. Accordingly, the decoding and macroblock parsing of videostreams burdens and degrades the MCU""s processing performance. In some devices, the processing overhead may be unacceptable.
According to the present invention, though, macroblock is provided without such processing overhead. Instead of using a macroblock parser to parse at macroblock boundaries in instances where the GOB is larger than the maximum transmission unit size, the incoming packets are monitored for partial GOBs. A partial GOB is one in which the GOB was previously parsed into portions by a device for compliance. The partial GOB is detected by monitoring the header GOB number (GOBN) field located at the H.261 header of the packet. If the GOBN value is zero, this indicates that the preceding GOB is complete. However, if the GOBN value is non-zero, the GOB has been parsed at a macroblock boundary upstream. Pertinent information pertaining to the macroblock fragmentation is then retrieved and stored for future use. For instance, because the current GOB is a continuation of the last read GOB (assuming that the packets were properly sequenced) the bit count of the last GOB is stored along with the values of the MBAP field, QUANT field, HMVD field, and VMVD field also located at the H.261 header. The pertinent partial GOBs are then combined to form a complete GOB. Subsequently, when this complete GOB needs to be parsed again for transmission, the stored information, that is, the bit count, the MBAP value, the QUANT value, the HMVD value, and the VMVD value are retrieved to parse the complete GOB back to its previous state (i.e., partial GOBs). In this manner, macroblock parsing is performed without the use of a macroblock parser and thereby eliminates the processing overhead associated with the macroblock parser.