This invention relates to digital video signal processing, and more particularly to the combining of plural digitally coded video input signals on a real time basis to produce an output digital video signal which, when decoded, merges the plural input signals into a single video image.
Video-conferencing through ISDN or Ethernet is becoming more and more popular. With the widespread availability of Desktop Video Conferencing systems, people can dial-up and see a remote party using their own PCs or workstations. However, currently available point-to-point video-conferencing lacks the capability of providing multipoint video-conferencing. A multipoint video-conference emulates a real conference more closely, hence the evolution from two-point to multipoint is a natural trend. Since the video signals are transmitted in coded digital compressed format, in order to set up multipoint connections, a multipoint control unit (MCU) is needed to handle all the coded video bit-streams which the participants' codecs generate. In first generation MCUs, only one received video bit-stream is selected and transmitted to each participant based either on the audio signal level or "chairman" switching controls. This is referred to as the "switched presence" MCU and has been standardized recently by the CCITT Study Group XV (now ITU-T SG15)(CCITT Study Group XV--Report R 93, Draft new Recommendation H.231, Multipoint control units for audiovisual systems using digital channels up to 2 Mbit/s, May 1992; CCITT Study Group XV--Report R 94, Draft new Recommendation H.243, Procedures for establishing communication between three or more audiovisual terminals using digital channels up to 2 Mbit/s, May 1992). In such MCUs, video data does not have to be processed, making realization easier.
In many situations viewing only one location at a time is too restrictive and it is desirable to see multiple parties all the time on a real time basis using a split-screen. For example, it is advantageous for a teacher lecturing to multiple remote students/classrooms to see each student/classroom simultaneously rather than just one student/classroom at a time. Also, in a multiparty video conference, it is preferable from an information flow standpoint for the parties to see multiple selected participants simultaneously on a split-screen rather than just one party whose one image fills the entire screen. An MCU that combines multiple signals is referred to as a "continuous presence" MCU. One way that a continuous presence MCU can operate is by performing pel(pixel element)-domain video mixing, hereafter referred to as "transcoding." In transcoding, coded video sources are fully decoded and combined in the pel domain. The resultant compound picture is encoded again and distributed to each participant. Such decoding, combining in the pel domain, and encoding introduces additional delay, degradation to the signal, and results in significant codec expense at the MCU. Specifically, the CCITT standard H.320 terminal used for point-to-point video-conferencing, which would also be used for multiparty video-conferencing, incorporates G-series audio coders (G.711, G.722 or G.723) for coding the audio signal; an H.261 video codec for coding the video signal into a signal having a rate of p.times.64 kbits/sec; an H.230/H.242 end-to-end signaling and call setup protocol processor which informs the opposite end of the capabilities of the transmitter (e.g., maximal allowable frame rates); and an H.221 multiplexer which multiplexes the outputs of the G-series audio coder, the H.261 codec and the H.230/H.242 protocol processor coder into one bit stream. In order for a continuous presence MCU to combine plural inputs received from an H.320 terminal in the pel domain, it would require that same plural number of terminals to decode each signal, buffers to store each input, a processor to synthesize a new picture from the inputs, and another H.320 terminal to transmit the merged coded video signal back to each of the parties. Since the cost of an H.320 terminal is today in the order of $20,000, such an MCU would likely be considerably expensive and would further impose the aforenoted picture degradation and delay.
An alternative to transcoding is coded-domain video combining, hereafter referred to as "combining." Video combining in the coded domain advantageously offers shorter end-to-end delay, better picture quality, and lower MCU cost. Real-time video combining in the coded domain is possible if the incoming video bit-streams follow the syntax of the H.261 standard (CCITT Study Group XV--Report R 95, Draft revised Recommendation H.261, Video Codec for Audiovisual Services at p.times.64 kbit/s, May 1992). In accordance with the H.261 syntax, the top two layers of which are shown in FIG. 1, an H.261 coded bit stream is composed of pictures (video frames) that start with a picture start code (PSC) and are followed by a temporal reference number (TR) that indicates a frame sequence number, and are then followed after several other code words with several Groups of Blocks (GOBs) of data. An H.261 coder can transmit a coded picture in either a Quarter Common Intermediate Format (QCIF) consisting of 176 horizontal pels.times.144 scan lines, or in a Common Intermediate Format (CIF) consisting of 352 horizontal pels.times.288 scan lines. There are three GOBs, numbered 1, 3, and 5, as shown in FIG. 2, in a QCIF coded picture, each GOB consisting of eleven horizontal by three vertical macro blocks of pel data, wherein each macro block consists of four luminance blocks. Each CIF coded picture consists of twelve GOBS, numbered 1-12, as shown in FIG. 3 and two chrominance blocks. Each block consists of 8.times.8 pels. Four QCIF data streams from up to four video-conference participants can thus be combined into one CIF as shown in FIG. 4. Thus, GOBs 1, 3, and 5, which are sequentially inputted from each of the QCIF inputs, QCIF I, QCIF II, QCIF III, and QCIF IV, can be renumbered with GOB numbers 1-12, as shown, and outputted sequentially. In FIG. 4, Sij, (1.ltoreq.i.ltoreq.4, 1.ltoreq.j.ltoreq.3) designates the size of the jth coded GOB within QCIF i. The combined coded video signal is then sent back to all the video-conference participants. After decoding by a standard H.320 terminal, each participant simultaneously sees up to four conferees, which may include themselves, on a 2.times.2 split-screen.
Although combining four QCIF inputs to form a merged CIF output appears to be a straightforward process, in fact it is not because of the characteristics of the coded signal inputs. Since the incoming GOBs in the QCIF inputs are of variable-length due to variable-length encoding of the input video signals (i.e., Sij varies both from frame-to-frame, from QCIF-to-QCIF, and within each QCIF), buffering of the input signals is necessary to store arriving GOBs until they are needed to be placed in the combined output signal. Delay problems can arise when, in forming the CIF output, an input GOB is needed that has not in fact been fully received. Furthermore, and very significantly, depending on the pictorial complexity of a video input, a participant's terminal may in fact not transmit each video frame of data. Each of the participants H.320 terminals has a common maximum allowable transmitting frame rate of 7.5, 10, 15 or 30 frames/sec, determined at call setup by the maximum allowable transmitting frame rate of the participant's terminal with the lowest frame rate capability. A complex video image which generates significantly more bits in its variable-length coded bit stream cannot be transmitted by the terminal at that maximum frame rate within the bit-rate channel capacity of the data link to the MCU. Video frames are thus not transmitted and are dropped. Since the four QCIF data input sequences are likely to have different complexities, and have unequal frame rates, they cannot simply be combined in their input order if the combined output signal is to remain in frame synchronization.
An object of the present invention is to combine coded multiple video signal inputs in real time and in such a manner that the merged output coded video signal maintains frame synchronization.
An additional object of the present invention is to combine coded multiple video signal inputs in real time with minimum delay through the MCU.