Transmission of moving pictures in real-time is employed in several applications like e.g. video conferencing, net meetings, TV broadcasting and video telephony.
However, representing moving pictures requires bulk information as digital video typically is described by representing each pixel in a picture with 8 bits (1 Byte). Such uncompressed video data results in large bit volumes, and can not be transferred over conventional communication networks and transmission lines in real time due to limited bandwidth.
Thus, enabling real time video transmission requires a large extent of data compression. Data compression may, however, compromise with picture quality. Therefore, great efforts have been made to develop compression techniques allowing real time transmission of high quality video over bandwidth limited data connections.
In video compression systems, the main goal is to represent the video information with as little capacity as possible. Capacity is defined with bits, either as a constant value or as bits/time unit. In both cases, the main goal is to reduce the number of bits.
The most common video coding method is described in the MPEG* and H.26* standards. The video data undergo four main processes before transmission, namely prediction, transformation, quantization and entropy coding.
The prediction process significantly reduces the amount of bits required for each picture in a video sequence to be transferred. It takes advantage of the similarity of parts of the sequence with other parts of the sequence. Since the predictor part is known to both encoder and decoder, only the difference has to be transferred. This difference typically requires much less capacity for its representation. The prediction is mainly based on picture content from previously reconstructed pictures where the location of the content is defined by motion vectors. The prediction process is typically performed on square block sizes (e.g. 16×16 pixels).
Video conferencing systems also allow for simultaneous exchange of audio, video and data information among multiple conferencing sites. Systems known as multipoint control units (MCUs) perform switching functions to allow multiple sites to intercommunicate in a conference. The MCU links the sites together by receiving frames of conference signals from the sites, processing the received signals, and retransmitting the processed signals to appropriate sites. The conference signals include audio, video, data and control information. In a switched conference, the video signal from one of the conference sites, typically that of the loudest speaker, is broadcast to each of the participants. In a continuous presence conference, video signals from two or more sites are spatially mixed to form a composite video signal for viewing by conference participants. The continuous presence or composite image is a combined picture that may include live video streams, still images, menus or other visual images from participants in the conference.
In a typical continuous presence conference, the video display is divided into a composite layout having areas or regions (e.g., quadrants). Sites are selected at conference setup from the sites connected in the conference for display in the regions. Common composite layouts include four, nine or sixteen regions. The layout is selected and then fixed for the duration of the conference.
Some conference arrangements provide different composite signals or video mix such that each site may view a different mix of sites. Another arrangement uses voice activated quadrant selection to associate sites with particular quadrants. That arrangement enables conference participants to view not only fixed video mix sites, but also a site selected on the basis of voice activity. However, the layout in terms of number of regions or quadrants is fixed for the conference.
Referring now to FIG. 1, there is shown a schematic diagram of an embodiment of an MCU 10 of the type disclosed in U.S. Pat. No. 5,600,646, the disclosure of which is hereby expressly incorporated by reference. The MCU 10 also includes H.323 functionality as disclosed in U.S. Pat. No. 6,404,745, the disclosure of which is hereby also expressly incorporated by reference. In addition, video processing in the MCU has been enhanced, as will be described further herein. The features described herein for MCU 10 can be embodied in a Tandberg MCU.
The MCU 10 includes at least one Network Interface Unit (NIU) 120, at least one Bridge Processing Unit (BPU) 122, a Video Processing Unit (VPU) 124, a Data Processing Unit (DPU) 126, and a Host Processing Unit (HPU) 130. In addition to a host Industry Standard Architecture (ISA) control bus 132, the MCU 10 includes a network bus 134, a BPU bus 136 and an X-bus 138. The network bus 134 complies with the Multi-Vendor Integration Protocol (MVIP) while the BPU bus 136 and the X-bus are derivatives of the MVIP specification. The HPU 130 provides a management interface for MCU operations. Each of the foregoing MCU elements is further described in the above-referenced U.S. Pat. Nos. 5,600,646 and 6,404,745.
The H.323 functionality is provided by the addition of a Gateway Processing Unit (GPU) 128 and a modified BPU referred to as a BPU-G 122A. The GPU 128 runs H.323 protocols for call signaling and the creation and control of audio, video and data streams through an Ethernet or other LAN interface 140 to endpoint terminals. The BPU-G 122A is a BPU 122 that is programmed to process audio, video and data packets received from the GPU 128.
The MCU operation is now described at a high-level, initially for circuit switched conferencing and then for packet switched H.323 conferencing. In circuit switched conferencing, digital data frames from H.320 circuit switched endpoint terminals are made available on the network bus 134 through a network interface 142 to an NIU 120. The BPUs 122 process the data frames from the network bus 134 to produce data frames which are made available to other BPUs 122 on the BPU bus 136. The BPUs 122 also extract audio information from the data frames.
The BPUs 122 combine compressed video information and mixed encoded audio information into frames that are placed on the network bus 134 for transmission to respective H.320 terminals.
In cases where the audiovisual terminals operate at different transmission rates or with different compression algorithms or are to be mixed into a composite image, multiple video inputs are sent to the VPU 124 where the video inputs are decompressed, mixed and recompressed into a single video stream. This single video stream is then passed back through the BPU 122 which switches the video stream to the appropriate endpoint terminals.
For packet-based H.323 conferencing, the GPU 128 makes audio, video and data packets available on the network bus 134. The data packets are processed through the DPU 126. The BPU-G 122A processes audio and video packets from the network bus 134 to produce audio and video broadcast mixes which are placed on the network bus 134 for transmission to respective endpoint terminals through the GPU 128. In addition, the BPU-G 122A processes audio and video packets to produce data frames which are made available to the BPUs 122 on the BPU bus 136. In this manner, the MCU 14 serves a gateway function whereby regular BPUs 122 and the BPU-G 122A can exchange audio and video between H.320 and H.323 terminals transparently.
Having described the components of the MCU 10 that enable the basic conference bridging functions, a high level description of the flexibility provided by the VPU 124 is now described with reference to the functional block diagram of FIG. 2. In the MCU 10, compressed video information from up to five audiovisual terminals that are in the same conference are routed to a particular VPU 124 over the BPU bus 136. The VPU 124 comprises five video compression processors (VCP0-VCP4), each having a video decoder/encoder pair 102-i, 106-i, and pixel scaling blocks 104-i, 108-i. 
A video decoder/encoder pair 102-i, 106-i is assigned to the compressed video information stream associated with each particular site in the conference. Each video decoder 102-i decodes the compressed video information using the algorithm that matches the encoding algorithm of its associated site. Included as part of the video decoder 102-i may be the processing to determine the framing, packets, and checksums that may be part of the transmission protocol. It should be noted that a processor encoded video stream can be assigned to multiple sites (e.g., a continuous presence application having more than five sites in the conference). In addition, a decoder/encoder pair 102-i, 106-i can switch among the sites within a conference.
The decoded video information (e.g., pixels) is scaled up or down, if necessary, by a pixel scaling block 104-i to match the pixel resolution requirements of other sites in the conference that will be encoding the scaled pixels. For example, a desktop system may encode at a resolution of 256×240 pixels while an H.320 terminal may require a pixel resolution of 352×288 pixels for a Common Intermediate Format (CIF) image. Other common formats include Quarter Common Intermediate Format (QCIF) (176×144 pixels), 4CIF (704×576), SIF (352×240), 4SIF (704×480), VGA (640×480), SVGA (800×600) and XGA (1024×768).
The VPU 124 includes a pixel bus 182 and memory 123. The system disclosed in U.S. Pat. No. 5,600,646 uses a time division multiplex bus. In particular, each decoder 102-j outputs pixels onto pixel bus 182 to memory 123. Each encoder 106-j may retrieve any of the images from the memory 123 on the pixel bus for re-encoding and/or spatial mixing or compositing. Another pixel scaling block 108-j is coupled between the pixel bus 182 and the encoder 106-j for adjusting the pixel resolution of the sampled image as needed.
A continuous presence application is now described with reference to FIGS. 3 and 4. For simplicity the endpoint terminals as shown are H.320 terminals. In FIG. 3, data from sites 38 arrive over a communications network to respective NIUs 120. Five sites 38 (A, B, C, D, E) are connected in the conference. Sites A and B are shown connected to a particular NIU 120 which supports multiple codec connections (e.g., a T1 interface). The other sites C, D, and E connect to NIUs 120 supporting only a single codec connection (e.g., an ISDN interface). Each site 38 places one or more octets of digital data onto the network bus 134 as unsynchronized H.221 framed data. The BPUs 122 then determine the H.221 framing and octet alignment. This aligned data is made available to all other units on the BPU bus 136. The BPUs 122 also extract audio information from the H.221 frames and decode the audio into 16 bit PCM data. The decoded audio data is made available on the BPU bus 136 for mixing with audio data from other conference sites.
Aligned H.221 frames are received by the VPU 124 for processing by encoder/decoder elements called video compression processors (VCPs). The VPU 124 has five VCPs (FIG. 2) which in this example are respectively assigned to sites A, B, C, D, E. A VCP on the VPU 124 which is assigned to site E is functionally illustrated in FIG. 4. Compressed video information (H.261) is extracted from the H.221 frames and decoded by the VCP as image X. The decoder video image X is placed on the pixel bus 182 through a scaling block. FIG. 4 shows the pixel bus 182 with decoded video frames from each site A, B, C, D, E successively retrieved from memory 123 identified by their respective RAM addresses. The VCP assigned to site E receives the decoded video frames from sites A, B, C and D which are then tiled (spatially mixed) into a single composite image I. The tiled image I is then encoded as H.261 video within H.221 framing and placed on the BPU bus 136 (FIG. 3) for BPU processing as described above.
As can be seen from the description above, transcoding requires considerable processing resources, as raw pixel data has to be mixed and thereafter encoded to form a mixed view or a Continuous Presence view. To avoid self view, i.e. to avoid that the CP views contains a picture of the respective participants to which they are transmitted, the MCU has to include at least one encoder for each picture in a CP view. To allow for CP 16, the MCU then must include at least 16 encoders.