This invention relates generally to the field of the multimedia applications. More particularly, this invention relates to a encoder/compressor, decoder/decompressor, a new frame type and method for encoding/decoding video sequences and providing access to a video stream.
Multimedia applications that include audio and video information have come into greater use. Several multimedia groups have established and proposed standards for compressing/encoding and decompressing/decoding the audio and video information. The examples are MPEG standards, established by the Motion Picture Expert Group and standards developed by ITU-Telecommunications Standardization.
The following are incorporated herein by reference:
G. Bjontegaard, “H.26L Test Model Long Term Number 6 (TML-6) draft0”, document VCEG-L45, ITU-T Video Coding Experts Group Meeting, Eibsee, Germany, 09-12 Jan. 2001. Keiichi Hibi, “Report of the Ad Hoc Committee on H.26L Development”, document Q15-H-07, ITU-T Video Coding Experts Group (Question 15) Meeting, Berlin, 03-06 Aug., 1999. Gary S. Greenbaum, “Remarks on the H.26L Project: Streaming Video Requirements for Next Generation Video Compression Standards”, document Q15-G-11, ITU-T Video Coding Experts Group (Question 15) Meeting, Monterey, 16-19 Feb., 1999. G. Bjontegaard, “Recommended Simulation Conditions for H.26L”, document Q15-1-62, ITU-T Video Coding Experts Group (Question 15) Meeting, Red Bank, N.J., 19-22 Oct., 1999. ATM & MPEG-2 Integrating Digital Video into Broadband Networks by Michael Orzessek and Peter Sommer (Prentice Hall Upper Saddle River New Jersey).
Video sequences comprise a sequence of still images, and the illusion of motion is created by displaying consecutive images in sequence at a relatively fast rate. Typically, the display rate is between five and thirty frames per second. A typical scene recorded by a camera comprises stationary elements and moving elements. An example of stationary elements is background scenery. The moving elements may take many different forms, for example, the face of a news reader, moving traffic, and so on. Alternatively, the camera recording the scene may itself be moving, in which case all elements of the image have the same kind of motion. In such cases, this means that the change between one video frame and the next one is rather small, i.e., the consecutive frames tend to be similar. This similarity is referred to as the correlation between frames or temporal redundancy. Likewise, in typical video sequences, neighboring regions/pixels within a frame exhibit strong similarities. This type of similarity is referred to as the spatial redundancy or spatial correlation. The redundancy in video sequences can then be categorized into spatial and temporal redundancy. The purpose of the video coding is to remove the redundancy in the video sequence.
In the existing video coding standards, there are three types of video frame encoding algorithms; classified based on the type of redundancy exploited, temporal or spatial. Intra-frame or I-type frame, depicted in FIG. 1A, 200 is a frame of video data that is coded exploiting only the spatial correlation of the pixels within the frame without using any information from the past or the future frames. I-frames are utilized as the basis for decoding/decompression of other frames. FIG. 1B depicts Predictive-frame or P-type frame 210. The P-type frame or picture is a frame that is encoded/compressed using prediction from I-type or P-type frames of its past, in this case, I.sub.1 200. 205a represents the motion compensated prediction information to create a P-type frame 210. Since in a typical video sequence the adjacent frames in a sequence are highly correlated, higher compression efficiencies are achieved when using P-frames instead of I-frames. On the other hand, P-frames can not be decoded independently without the previous frames.
FIG. 1C depicts a Bi-directional-frame or B-type frame 220. The B-type frame or picture is a frame that is encoded/compressed using a prediction derived from the I-type reference frame (200 in this example) or P-type reference frame in its past and the I-type reference frame or P-type reference frame (210 in this example) in its future or a combination of both. FIG. 2 represents a group of pictures in what is called display order I.sub.1 B.sub.2 B.sub.3 P.sub.4 B.sub.5 P.sub.6. FIG. 2 illustrates the B-type frames inserted between I-type and P-type frames and the direction which motion compensation information flows.
Referring to FIGS. 3 and 4, a communication system comprising an encoder 300 of FIG. 3 and a decoder 400 of FIG. 4 is operable to communicate a multimedia sequence between a sequence generator and a sequence receiver. Other elements of the video sequence generator and receiver are not shown for the purposes of simplicity. The communication path between sequence generator and receiver may take various forms, including but not limited to a radio-link.
Encoder 300 is shown in FIG. 3 coupled to receive video input on line 301 in the form of a frame to be encoded, called the current frame, I(x,y). By (x,y) we denote location of the pixel within the frame. In the encoder the current frame I(x,y) is partitioned into rectangular regions of M×N pixels. These blocks are encoded using either only spatial correlation (intra coded blocks) or both spatial and temporal correlation (inter coded blocks). In what follows we concentrate on inter blocks.
Each of inter coded blocks is predicted using motion information from the previously coded and transmitted frame, called reference frame and denoted as R(x,y), which is available in the frame memory 350 of the encoder 300. The motion information of the block may be represented by two dimensional motion vector (Δx(x,y), Δy(x,y)) where Δx(x,y) is the horizontal and Δy(x,y) is the vertical displacement, respectively, of the pixel in location (x,y) between the current frame and the reference frame. The motion vectors (Δx( ), Δy( )) are calculated by the motion estimation and coding block 370. The input to the motion estimation and coding block 370 are current frame and reference frame. The motion estimation and coding block finds the best matching block, according to a certain criteria, from the reference frame to the current block. The motion information is provided to a Motion Compensated (MC) prediction block 360. The MC prediction block is also coupled to a frame memory 350 to receive the reference frame. In the MC block 360, prediction frame P(x,y) is constructed with the use of the motion vectors for each inter block together with the reference frame by,P(x,y)=R(x+Δx(x,y), y+Δy(x,y)).
Notice that the values of the prediction frame for inter blocks are calculated from the previously decoded frame. This type of prediction is refered as motion compensated prediction. It is also possible to use more than one reference frame. In such a case, different blocks of the current frame may use different reference frames. For pixels (x,y) which belong to intra blocks, prediction blocks are either calculated from the neighboring regions within the same frame or are simply set to zero.
Subsequently, the prediction error E(x,y) is defined as the difference between the current frame and the prediction frame P(x,y) and is given by:E(x,y)=I(x,y)−P(x,y).
In transform block 310, each K×L block in the prediction error E(x,y) is represented as weighted sum of a transform basis functions f.sub.ij(x,y),       E    ⁢          (              x        ,        y            )        =            ∑              i        =        1            K        ⁢                  ∑                  j          =          1                L            ⁢                        c          .          sub          .          err                ⁢                                   ⁢                  (                      i            ,            j                    )                ⁢                                   ⁢                  f          .          sub          .          ij                ⁢                                   ⁢                              (                          x              ,              y                        )                    .                    
The weights c.sub.err(i,j), corresponding to the basis functions are called prediction error coefficients. Coefficients c.sub.err(i,j) can be calculated by performing so called forward transform. These coefficients are quantized in quantization block 320:I.sub.err(i,j)=Q(c.sub.err(i,j),QP)
where I.sub.err(i,j) are the quantized coefficients and QP is the quantization parameter. The quantization introduces loss of information while the quantized coefficient can be represented with smaller number of bits. The level of compression (loss of information) is controlled by adjusting the value of the quantization parameter (QP).
The special type of the inter coded blocks are copy coded blocks. For copy coded blocks, values of both motion vectors and quantized prediction error coefficients I.sub.err are equal to 0.
Motion vectors and quantized coefficients are usually encoded using an entropy coder, for example, Variable Length Codes (VLC). The purpose of entropy coding is to reduce the number of bits needed for their representation. Certain values of motion vectors and quantized coefficients are more likely than other values. And entropy coding techniques assign less number of bits to represent more likely values than for those that are less likely to occur. Entropy encoded motion vectors and quantized coefficients as well as other additional information needed to represent each coded frame of the image sequence is multiplexed at a multiplexer 380 and the output constitutes a bitstream 415 which is transmitted to the decoder 400 of FIG. 4.
For color pictures, color information must be provided for every pixel of an image. Typically, color information is coded in terms of the primary color components red, green and blue (RGB) or using a related luminance/chrominance model, known as the YUV model. This means that there are three components to be encoded, for example for YUV model one luminance and two color difference components, YCbCr. The encoding of luma components is performed as described above. The encoding of chroma is similar to that of luma using the same coding blocks as described above but certain values calculated while encoding luma components are used during encoding of chroma components, for example, motion vectors obtained from luma components are reused for encoding of chroma components.
The rest of the blocks in encoder 300 represent the decoder loop of the encoder. Decoder loop reconstructs the frames from the calculated values just as the same way as the decoder 400 does from 415. Therefore encoder, at all times, will have the same reconstructed frames as the ones on the decoder side. Following provides a list of these blocks and a detailed description of these blocks will follow when decoder 400 is described. The quantization block 320 is coupled to both a multiplexer 380 and an inverse quantization block 330 and in turn an inverse transform block 340. Blocks 330 and 340 provide decoded prediction error E.sub.c(x, y) which is added to the MC predicted frame P(x,y) by adder 345. These values can be further normalized and filtered. The resulting frame is called the reconstructed frame and is stored in frame memory 350 to be used as reference for the prediction of future frames.
FIG. 4 shows the decoder 400 of the communication system. Bitstream 415 is received from encoder 300 of FIG. 3. Bitstream 415 is demultiplexed via demultiplexer 410. Dequantized coefficients d.sub.err(i,j) are calculated in the inverse quantization block 420:d.sub.err(i,j)=Q−1(I.sub.err(i,j), QP).
Inverse transform is performed on the dequantized coefficients to reconstruct the prediction error in inverse transform block 430:             E      .      sub      .      c        ⁢                   ⁢          (              x        ,        y            )        =            ∑              i        =        1            K        ⁢                  ∑                  j          =          1                L            ⁢                        d          .          sub          .          err                ⁢                                   ⁢                  (                      i            ,            j                    )                ⁢                                   ⁢                  f          .          sub          .          ij                ⁢                                   ⁢                              (                          x              ,              y                        )                    .                    
The prediction block P(x,y) for the current block is calculated by using the received motion vectors and the previously decoded reference frame(s). The pixel values of the current frame are then reconstructed by adding prediction P(x,y) to the prediction error E.sub.c(x,y) in adder 435:
 I.sub.c(x,y)=R(x+Δx, y+Δ,y)+E.sub.c(x,y).
These values can be further normalized and filtered to obtain the reconstructed frame. The reconstructed frame is stored in frame memory 440 to be used as reference frame for future frames.
An example of a forward transform is provided by “H.26L Test Model Long Term Number 6 (TML-6) draft0”, document VCEG-L45, ITU-T Video Coding Experts Group Meeting, Eibsee, Germany, 09-12 Jan. 2001. The forward transformation of some pixels a, b, c, d into 4 transform coefficients A, B, C, D is defined by:A=13a+13b+13c+13dB=17a+7b−7c−17dC=13a−13b−13c+13dD=7a−17b+17c−7d
The inverse transformation of transform coefficients A, B, C, D into 4 pixels a′, b′, c′, d′ is defined by:a′=13A+17B+13C+7Db′=13A+7B−13C−17Dc′=13A−7B−13C+17Dd′=13A−17B+13C−7D
The transform/inverse transform is performed for 4×4 blocks by performing defined above one dimensional transform/inverse transform both vertically and horizontally.
In “H.26L Test Model Long Term Number 6 (TML-6) draft0”, document VCEG-L45, ITU-T Video Coding Experts Group Meeting, Eibsee, Germany, 09-12 Jan. 2001, for chroma component, an additional 2×2 transform for the DC coefficients is performed as follows: chroma components are partitioned into 8×8 blocks called macroblocks and after 4×4 transform of each of the four blocks in 8×8 macroblock, DC coefficients, i.e., (0,0) coefficients, of the blocks are rearranged and are labeled as DC0, DC1, DC2, and DC3, and an additional transformation is performed on these DC coefficients by,DCC(0,0)=(DC0+DC1+DC2+DC3)/2DCC(1,0)=(DC0−DC1+DC2−DC3)/2 DCC(0,1)=(DC0+DC1−DC2−DC3)/2DCC(1,1)=(DC0−DC1−DC2+DC3)/2
Definition of the corresponding inverse transform:DC0=(DCC(0,0)+DCC(1,0)+DCC(0,1)+DCC(1,1))/2DC1=(DCC(0,0)−DCC(1,0)+DCC(0,1)−DCC(1,1))/2DC2=(DCC(0,0)+DCC(1,0)−DCC(0,1)−DCC(1,1))/2DC3=(DCC(0,0)−DCC(1,0)−DCC(0,1)+DCC(1,1))/2
In “H.26L Test Model Long Term Number 6 (TML-6) draft0”, document VCEG-L45, ITU-T Video Coding Experts Group Meeting, Eibsee, Germany, 09-12 Jan. 2001 to obtain values of reconstructed image the results of the inverse transform are normalized by shifting by 20 bits (with rounding).
An example of quantization/dequantization is provided by “H.26L Test Model Long Term Number 6 (TML-6) draft0”, document VCEG-L45, ITU-T Video Coding Experts Group Meeting, Eibsee, Germany, 09-12 Jan. 2001. A coefficient c is quantized in the following way:I=(c×A(QP)+fx220)//220
where f may be in the range (−0.5 to +0.5) and f may have the same sign as c. By // division with truncation is denoted. The dequantized coefficient is calculated as follows:d=I×B(QP)
Values of A(QP) and B(QP) are given below:
A(QP=0, . . . , 31)=[620, 553, 492, 439, 391, 348, 310, 276, 246, 219, 195, 174, 155, 138, 123, 110, 98, 87, 78, 69, 62, 55, 49, 44, 39, 35, 31, 27, 24, 22, 19, 17];
B(QP=0, . . . , 31)=[3881, 4351, 4890, 5481, 6154, 6914, 7761, 8718, 9781, 10987, 12339, 13828, 15523, 17435, 19561, 21873, 24552, 27656, 30847, 34870, 38807, 43747, 49103, 54683, 61694, 68745, 77615, 89113, 100253, 109366, 126635, 141533];
Video streaming has emerged as one of the essential applications over the fixed internet and in the near future over 3G multimedia networks. In streaming applications, the server starts streaming the pre-encoded video bitstream to the receiver upon a request from the receiver which plays the stream as it receives with a small delay. The best-effort nature of today's networks causes variations of the effective bandwidth available to a user due to the changing network conditions. The server should then scale the bitrate of the compressed video to accommodate these variations. In case of conversational services that are characterized by real-time encoding and point-to-point delivery, this is achieved by adjusting, on the fly, the source encoding parameters, such as quantization parameter or frame rate, based on the network feedback. In typical streaming scenarios when already encoded video bitstream is to be streamed to the client, the above solution can not be applied.
The simplest way of achieving bandwidth scalability in case of pre-encoded sequences is by producing multiple and independent streams of different bandwidth and quality. The server then dynamically switches between the streams to accommodate variations of the bandwidth available to the client.
Now assume that we have multiple bitstreams generated independently with different encoding parameters, such as quantization parameter, corresponding to the same video sequence. Since encoding parameters are different for each bitstream, the reconstructed frames of different bitstreams at the same time instant will not be the same. Therefore when switching between bitstreams, i.e., starting to decode a bitstream, at arbitrary locations would lead to visual artifacts due to the mismatch between the reference frames used to obtain predicted frame. Furthermore, the visual artifacts will not only be confined to the switched frame but will further propagate in time due to motion compensated coding.
In the current video encoding standards, perfect (mismatch-free) switching between bitstreams is achieved possible only at the positions where the future frames/regions do not use any information previous to the current switching location, i.e., at I-frames. Furthermore, by placing I-frames at fixed (e.g. 1 sec) intervals, VCR functionalities, such as random access or “Fast Forward” and “Fast Backward” (increased playback rate) for streaming video content, are achieved. User may skip a portion of video and restart playing at any I-frame location. Similarly, increased playback rate can be achieved by transmitting only I-pictures. The drawback of using I-frames in these applications is that since I-frames do not utilize temporal redundancy they require much larger number of bits than P-frames.
The above-mentioned references are exemplary only and are not meant to be limiting in respect to the resources and/or technologies available to those skilled in the art.