For the purpose of the present discussion, the term “internet” will be used both in its familiar sense and also in its generic sense to identify a network connection over any electronic communications medium or collection of cooperating communications systems.
Currently, most video content which is available over the internet must be pre-loaded in a process which can take many minutes over typical modem connections, after which the video quality and duration can still be quite disappointing. In some contexts video streaming is possible, where the video is decompressed and rendered in real-time as it is being received; however, this is limited to compressed bit-rates which are lower than the capacity of the relevant network connections. The most obvious way of addressing these problems would be to compress and store the video content at a variety of different bit-rates, so that individual clients could choose to browse the material at the bit-rate and attendant quality most appropriate to their needs and patience. Approaches of this type, however, do not represent effective solutions to the video browsing problem. To see this, suppose that the video is compressed at bit-rates of R, 2R, 3R, 4R and 5R. Then storage must be found on the video server for all these separate compressed bit-streams, which is clearly wasteful. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, a complete new version must be downloaded at a higher bit-rate; this new bit-stream must, take longer to download, which generally rules out any possibility of video streaming.
To enable real solutions to the remote video browsing problem, scalable compression techniques are essential. Scalable compression refers to the generation of a bit-stream which contains embedded subsets, each of which represents an efficient compression of the original video with successively higher quality. Returning to the simple example above, a scalable compressed video bit-stream might contain embedded sub-sets with the bit-rates of R, 2R, 3R, 4R and 5R, with comparable quality to non-scalable bit-streams, having the same bit-rates. Because these subsets are all embedded within one another, however, the storage required on the video server is identical to that of the highest available bit-rate. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, only the incremental contribution required to achieve the next higher level of quality must be retrieved from the server. In a particular application, a version at rate R might be streamed directly to the client in real-time; if the quality is insufficient, the next rate-R increment could be streamed to the client and added to the previous, cached bit-stream to recover a higher quality rendition in real time. This process could continue indefinitely without sacrificing the ability to display the incrementally improving video content in realtime as it is being received from the server.
The above application could be extended in a number of exciting ways. Firstly, if the scalable bit-stream also contains distinct subsets corresponding to different intervals in time, then a client could interactively choose to refine the quality associated with specific time segments which are of the greatest interest. Secondly, if the scalable bit-stream also contains distinct subsets corresponding to different spatial regions, then clients could interactively choose to refine the quality associated with specific spatial regions over specific periods of time, according to their level of interest. In a training video, for example, a remote client could interactively “revisit” certain segments of the video and continue to stream higher quality information for these segments from the server, without incurring any delay.
To satisfy the needs of applications such as that mentioned above, low bit-rate subsets of the video must be visually intelligible. In practice, this means that most of the available bits will be devoted to a low bit-rate portion of the video are likely to contribute to the reconstruction of the video at a reduced frame rate, since attempting to recover the full frame rate video over a low bit-rate channel will result in unacceptable deterioration of the spatial details within each frame. In order to achieve smooth quality scalability within a compressed video sequence which also offers frame rate scalability, the details required to recover higher frame rates must contribute to the refinement of a model which involves motion sensitive temporal interpolation.
Without temporal interpolation, missing frames cannot be introduced into a low rate video sequence without first augmenting their spatial fidelity to a level commensurate with the frames already available, and this implies a large discontinuous jump in the amount of information which must be provided to the decoder in order to smoothly increase the reconstructed video quality. Continuing this line of argument, we see that motion information is important to highly scalable video compression; moreover, the motion itself, must be represented in a manner which can be scaled, according to the temporal resolution (frame rate), spatial resolution and quality of the sample data.
Motion Adaptive Transforms Based on Wavelet Lifting
The present invention is best appreciated in the context of an earlier invention, which is the subject of WO02/50772. This earlier patent application describes a method for modifying the individual lifting steps in a lifting implementation of a temporal wavelet decomposition, so as to compensate for the effects of motion. This method has the following advantageous properties: 1) the motion sensitive transform may be perfectly inverted, in the absence of any compression artefacts; 2) the low temporal resolution subsets of the wavelet hierarchy offer high spatial fidelity so that the transform allows excellent frame rate scalability; 3) the high pass temporal detail subbands produced by the transform have very low energy, allowing high compression efficiency; 4) in the absence of motion, the transform reduces to a regular wavelet decomposition along the temporal axis; and 5) in the presence of locally translational motion, the transform is equivalent to applying a regular wavelet decomposition along the motion trajectories.
To assist in the present discussion, we briefly summarise the key ideas behind this earlier invention. Any two-channel FIR subband transform can be described as a finite sequence of lifting steps [W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Applied and Computational Harmonic Analysis, vol 3, pp 196-2000, April 1996]. It is instructive to begin with an example based upon the Haar wavelet transform. Up to a scale factor, this transform may be realised in the temporal domain, through a sequence of two lifting steps, as
                    h        k            ⁡              [        n        ]              =                  x                              2            ⁢            k                    +          1                    -                        x                      2            ⁢            k                          ⁡                  [          n          ]                                        1        k            ⁡              [        n        ]              =                  x                  2          ⁢          k                    +                        1          2                ⁢                              h                          2              ⁢              k                                ⁡                      [            n            ]                              where xk[n]∝xk[n1,n2] denotes the samples of frame k from the original video sequence and hk[n]∝hk[n1,n2] and lk[n]∝lk[n1,n2] denote the high-pass and low-pass subband frames.
lk[n] and hk[n] correspond to the scaled sum and the difference of each original pair of flames. An example is shown in FIG. 1A. Since-motion is ignored, ghosting artefacts are clearly visible in the low-pass temporal subband, and the high-pass subband, frame has substantial energy.
Now let Wk1→k2 denote a motion-compensated mapping of frame k1 onto the coordinate system of frame k2, so that Wk1→k2(xk1[n]≈xk2[n]) for all n. The lifting steps are modified as follows.
                                          h            k                    ⁡                      [            n            ]                          =                  x                                    2              ⁢              k                        +                          1              ⁡                              [                n                ]                                      -                          W                                                2                  ⁢                  k                                →                                                      2                    ⁢                    k                                    +                                      1                    ⁢                                                                  (                                                  x                                                      2                            ⁢                            k                                                                          )                                            ⁡                                              [                        n                        ]                                                                                                                                                    (        1        )                                                      l            k                    ⁡                      [            n            ]                          =                                            x                              2                ⁢                k                                      ⁡                          [              n              ]                                +                                    1              2                        ⁢                          W                                                                    2                    ⁢                    k                                    +                  1                                →                                  2                  ⁢                                      k                    ⁡                                          (                                                                        h                          k                                                ⁡                                                  [                          n                          ]                                                                    )                                                                                                                              (        2        )            Note that W2k→2k+1 and W2k+1→2k represent forward and backward motion mappings, respectively. The high-pass subband frames correspond to motion-compensated residuals. These will be close to zero in regions where the motion is accurately modelled. The result is shown in FIG. 1B.
The framework described above is readily extended to any two-channel FIR subband transform, by motion-compensating the relevant lifting steps.
We demonstrate this in the important case of the biorthogonal 5/3 wavelet transform [D. Le Gall and A. Tabatabai, “Sub-band coding of digital images using symmetric short kernal filters and arithmetic coding techniques,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp 761-764, April 1988]. As before, x2k[n] and x2k+1[n] denote the even and odd indexed frames from the original sequence. Without motion, the 5/3 transform may be implemented by alternatively updating each of these two frame subsequences, based on filtered versions of the other sub-sequence. The lifting steps are
                    h        k            ⁡              [        n        ]              =                            x                                    2              ⁢              k                        +            1                          ⁡                  [          n          ]                    -                        1          2                ⁢                  (                                                    x                                  2                  ⁢                  k                                            ⁡                              [                n                ]                                      -                                          x                                                      2                    ⁢                    k                                    +                  2                                            ⁡                              [                n                ]                                              )                                        l        k            ⁡              [        n        ]              =                            x                      2            ⁢            k                          ⁡                  [          n          ]                    +                        1          4                ⁢                  (                                    h                              k                -                1                                      ⁡                          [              n              ]                                )                    
As before, we introduce motion warping operators within each lifting step, which yields the following
                                          h            k                    ⁡                      [            n            ]                          =                                            x                                                2                  ⁢                  k                                +                1                                      ⁡                          [              n              ]                                -                                    1              2                        ⁢                          (                                                                                          W                                                                        2                          ⁢                          k                                                →                                                                              2                            ⁢                            k                                                    +                          1                                                                                      ⁡                                          (                                              x                                                  2                          ⁢                          k                                                                    )                                                        ⁡                                      [                    n                    ]                                                  +                                                                            W                                                                                                    2                            ⁢                            k                                                    +                          2                                                →                                                                              2                            ⁢                            k                                                    +                          1                                                                                      ⁡                                          (                                              x                                                                              2                            ⁢                            k                                                    +                          2                                                                    )                                                        ⁡                                      [                    n                    ]                                                              )                                                          (        3        )                                                      l            k                    ⁡                      [            n            ]                          =                                            x                              2                ⁢                k                                      ⁡                          [              n              ]                                +                                    1              4                        ⁢                          (                                                                                          W                                                                                                    2                            ⁢                            k                                                    -                          1                                                →                                                  2                          ⁢                          k                                                                                      ⁡                                          (                                              h                                                  k                          -                          1                                                                    )                                                        ⁡                                      [                    n                    ]                                                  +                                                                            W                                                                                                    2                            ⁢                            k                                                    +                          1                                                →                                                                              2                            ⁢                            k                                                    +                          1                                                                                      ⁡                                          (                                              h                        k                                            )                                                        ⁡                                      [                    n                    ]                                                              )                                                          (        4        )            
FIG. 2 demonstrates the effect of these modified lifting steps. The highpass frames are now essentially the residual from a bidirectional motion compensated prediction of the odd-indexed original frames. When the motion is adequately captured, these high-pass frames have little energy and the low-pass frames have excellent spatial fidelity.
Counting the Cost of Motion
In the example of the Haar transform, given above, two separate motion mapping operators, W2k→2k+1 and W2k+1→2k, are required to process every pair of frames, x2k[n] and x2k+1[n]. Their respective motion parameters must be transmitted to the decoder. To provide a larger number of temporal resolution levels, the transform is re-applied to the low-pass subband frames, lk[n], for which motion mapping operators W4k→4k+2 and W4k+2→4k are required for every four frames. Continuing in this way, an arbitrarily large number of temporal resolutions may be obtained, using
      2    2    +      2    4    +      2    8    +      …    ⁢                  ⁢    .2  motion fields per original frame.
For the example of the 5/3 transform, also given above, four motion mapping operators, W2k→2k+1, W2k→2k−1, W2k+1→2k and W2k−1→2k are required for every pair of frames (indexed by k), for just one level of temporal decomposition. Continuing the transformation to an arbitrarily large number of temporal resolutions involves approximately 4 motion fields per original video frame.
The cost of estimating, coding and transmitting the above motion fields can be substantial. Moreover, this cost may adversely affect the scalability of the entire compression scheme, since it is not immediately clear how to progressively refine the motion fields without destroying the subjective properties of the reconstructed video when the motion is represented with reduced accuracy.
The previous invention clearly reveals the fact that any number of motion modelling techniques are compatible with the motion adaptive lifting transform, and also recommends the use of continuously deformable motion models such as those associated with, triangular or quadrilateral meshes (see, for example, Y. Nakaya and H. Harashima, “Motion compensation based on spatial transformations,” IEE Trans. Circ. Syst. For Video Tech., Vol. 4, pp 339-367, June 1994). However, no particular solution is presented to the difficulties described above.