1. Field of the Invention
Embodiments relate to methods and devices for mixing video streams at the macroblock level.
2. Background of the Related Art
With certain applications, it is necessary that the contents of multiple video streams be displayed simultaneously on one device. For example, video conferences are known that include more than two participants, where the video and the audio signals are transmitted in real time between two or more locations. For this purpose, the terminals or soft clients of the users are equipped with a camera, now mostly a USB webcam, and a microphone or a headset as input devices, as well as a screen and a speaker or headset as output devices. Encoding and decoding of the video and audio signals can be hardware-based via plug-in cards or purely software-based. Today, users of a video conferencing system typically demand that not only the currently speaking participant is seen by all other participants, as is the case with “voice activated switching” systems but that all or at least several of the participants in the conversation can see each other simultaneously on the screen, as is the case with “continuous presence” systems.
An additional application example is the field of video surveillance, where several video streams from different surveillance cameras are decoded simultaneously and displayed live on a screen in the control room. If the system uses only one decoder, then only one video stream from one surveillance camera can be decoded and displayed at any given time.
Due to the fact that many installed terminals or soft clients of video conferencing systems today include only one single decoder, it is not possible to decode or display several video streams simultaneously on these terminals or soft clients. For this reason, it is a very common procedure today to use a video bridge or a multipoint control unit (MCU). This is a central unit that receives and processes the encoded video streams of several participants and returns a dedicated video stream to all participants. For this purpose, the video streams must be decoded completely or at least mostly, and the video data must be combined and then encoded into a new video stream. FIG. 4 is a schematic presentation of the complete transcoding of two H.264-coded video streams. This method is often realized as a hardware-based implementation because it is very complex, which leads to high equipment costs. Furthermore, transcoding leads to delay times through the numerous signal processing steps and to quality losses through re-encoding.
An additional known method is the mixing of video streams at slice level as described in the prior application of the same applicant entitled “Mixing of Video Streams” by the inventors Peter Amon and Andreas Hutter.
In the H.264/AVC standard, the macroblocks are organized into so-called slices with the ability to decode each slice independently from the other slices. With flexible macroblock ordering (FMO) as defined in the H.264/AVC standard, a flexible assignment of macroblocks to slice groups is possible. According to the method, this possibility is now used for mixing several video streams. Thus, a slice group can be defined for each input video stream and can be combined into a stream with two slice groups using a video mixer. Shown in FIG. 5 is a schematic presentation of two H.264-coded video streams being mixed at slice level. However, many decoders in existence today do not support slice groups, such that mixing of video streams at slice level cannot be used.
Presumably, a method is known for the video coding standard H.261 that allows for combining several images into a new image at the macroblock level. The assumption that this method is known is based on the fact that the analyst report “Will Your Next Video Bridge Be Software-Based?” by Wainhouse Research in 2003 (http://www.wainhouse.com/files/papers/wrsw-video-bridges.pdf) reports on mixing of H.261 video streams, however, without providing more details about the method. Still, the performance measurements suggest that a method as described above and shown schematically in FIG. 6 is used because this many complete transcoding procedures cannot be performed simultaneously on a computer of the stated performance level.
H.261 uses a variable length codes (VLC) method for entropy coding. With variable length codes as used in the H.261 standard, a symbol to be coded is assigned permanently to a code word using a single code word table. In this manner, no dependence is established between the symbols and thus between the macroblocks. Through simple rearranging of the macroblocks, several video streams can then be assembled into one video stream.
In order to compress the transfer data once more, for example residual errors from predictions, difference in the estimated motion vectors, etc., they are coded using so-called entropy coding. The H.264/AVC standard offers two options for entropy coding, the context-based adaptive variable length coding (CAVLC) method and the context-based adaptive binary arithmetic coding (CABAC) method. Both are based on so-called adaptive context-based entropy coding, either with a variable code length or with binary arithmetic coding, and in this manner achieve performance advantages in the coding process compared to the other standards. With CAVLC, coding of a macroblock encounters dependencies of coding decisions based on adjacent already encoded macroblocks. With CABAC, encoding of a symbol affects the selection of the code word for the subsequent symbol, such that dependencies between the code words and thus between the macroblocks are created. The method for mixing video streams at the macroblock level shown for H.261-encoded streams cannot be applied directly for mixing H.264/AVC-encoded video streams.