The present invention relates to data processing and, more particularly, to data compression, for example as applied to still and video images, speech and music. A major objective of the present invention is to enhance collaborative video applications over heterogeneous networks of inexpensive general purpose computers.
As computers are becoming vehicles of human interaction, the demand is rising for the interaction to be more immediate and complete. Where text-based e-mail and database services predominated on local networks and on the Internet, the effort is on to provide such data intensive services such as collaborative video applications, e.g., video conferencing and interactive video.
In most cases, the raw data requirements for such applications far exceed available bandwidth, so data compression is necessary to meet the demand. Effectiveness is a goal of any image compression scheme. Speed is a requirement imposed by collaborative applications to provide an immediacy to interaction. Scalability is a requirement imposed by the heterogeneity of networks and computers.
Effectiveness can be measured in terms of the amount of distortion resulting for a given degree of compression. The distortion can be expressed in terms of the square of the difference between corresponding pixels averaged over the image, i.e., mean square error (less is better). The mean square error can be: 1) weighted, for example, to take variations in perceptual sensitivity into account; or 2) unweighted.
The extent of compression can be measured either as a compression ratio or a bit rate. The compression ratio (more is better) is the number of bits of an input value divided by the number of bits in the expression of that value in the compressed code (averaged over a large number of input values if the code is variable length). The bit rate is the number of bits of compressed code required to represent an input value. Compression effectiveness can be characterized by a plot of distortion as a function of bit rate.
Ideally, there would be zero distortion, and there are lossless compression techniques that achieve this. However, lossless compression techniques tend to be limited to compression ratios of about 2, whereas compression ratios of 20 to 500 are desired for collaborative video applications. Lossy compression techniques always result in some distortion. However, the distortion can be acceptable, even imperceptible, while much greater compression is achieved.
Collaborative video is desired for communication between general purpose computers over heterogeneous networks, including analog phone lines, digital phone lines, and local-area networks. Encoding and decoding are often computationally intensive and thus can introduce latencies or bottlenecks in the data stream. Often dedicated hardware is required to accelerate encoding and decoding. However, requiring dedicated hardware greatly reduces the market for collaborative video applications. For collaborative video, fast, software-based compression would be highly desirable.
Heterogeneous networks of general purpose computers present a wide range of channel capacities and decoding capabilities. One approach would be to compress image data more than once and to different degrees for the different channels and computers. However, this is burdensome on the encoding end and provides no flexibility for different computing power on the receiving end. A better solution is to compress image data into a low-compression/low distortion code that is readily scalable to greater compression at the expense of greater distortion.
State-of-the-art compression schemes have been promulgated as standards by an international Motion Picture Experts Group; the current standards are MPEG-1 and MPEG-2. These standards are well suited for applications involving playback of video encoded off-line. For example, they are well suited to playback of CD-ROM and DVD disks. However, compression effectiveness is non-optimal, encoding requirements are excessive, and scalability is too limited. These limitations can be better understood with the following explanation.
Most compression schemes operate on digital images that are expressed as a two-dimensional array of picture elements (pixels) each with one (as in a monochrome or gray-scale image) or more (as in a color image) values assigned to each pixel. Commonly, a color image is treated as a superposition of three independent monochrome images for purposes of compression.
The lossy compression techniques practically required for video compression generally involve quantization applied to monochrome (gray-scale or color component) images. In quantization, a high-precision image description is converted to a low-precision image description, typically through a many-to-one mapping. Quantization techniques can be divided into scalar quantization (SQ) techniques and vector quantization (VQ) techniques. While scalars can be considered one-dimensional vectors, there are important qualitative distinctions between the two quantization techniques.
Vector quantization can be used to process an image in blocks, which are represented as vectors in an n-dimensional space. In most monochrome photographic images, adjacent pixels are likely to be close in intensity. Vector quantization can take advantage of this fact by assigning more representative vectors to regions of the n-dimensional space in which adjacent pixels are close in intensity than to regions of the n-dimensional space in which adjacent pixels are very different in intensity. In a comparable scalar quantization scheme, each pixel would be compressed independently; no advantage is taken of the correlations between adjacent pixels. While, scalar quantization techniques can be modified at the expense of additional computations to take advantage of correlations, comparable modifications can be applied to vector quantization. Overall, vector quantization provides for more effective compression than does scalar quantization.
Another difference between vector and scalar quantization is how the representative values or vectors are represented in the compressed data. In scalar quantization, the compressed data can include reduced precision expressions of the representative values. Such a representation can be readily scaled simply by removing one or more least-significant bits from the representative value. In more sophisticated scalar quantization techniques, the representative values are represented by indices; however, scaling can still take advantage of the fact that the representative values have a given order in a metric dimension. In vector quantization, representative vectors are distributed in an n-dimensional space. Where n&gt;1, there is no natural order to the representative vectors. Accordingly, they are assigned effectively arbitrary indices. There is no simple and effective way to manipulate these indices to make the compression scalable.
The final distinction between vector and scalar quantization is more quantitative than qualitative. The computations required for quantization scale dramatically (more than linearly) with the number of pixels involved in a computation. In scalar quantization, one pixel is processed at a time. In vector quantization, plural pixels are processed at once. In the case of popular 4.times.4 and 8.times.8 block sizes, the number of pixels processed at once becomes 16 and 64, respectively. To achieve minimal distortion, "full-search" vector quantization computes the distances in an n-dimensional space of an image vector from each representative vector Accordingly, vector quantization tends to be much slower than scalar quantization and, therefore, limited to off-line compression applications.
Because of its greater effectiveness, considerable effort has been directed to accelerating vector quantization by eliminating some of the computations required. There are structured alternatives to "full-search" VQ that reduce the number of computations required per input block at the expense of a small increase in distortion. Structured VQ techniques perform comparisons in an ordered manner so as to exclude apparently unnecessary comparisons. All such techniques involve some risk that the closest comparison will not be found. However, the risk is not large and the consequence typically is that a second closest point is selected when the first closest point is not. While the net distortion is larger than with full search VQ, it is typically better than scalar VQ performed on each dimension separately.
In "tree-structured" VQ, comparisons are performed in pairs. For example, the first two measurements can involve codebook points in symmetrical positions in the upper and the lower halves of a vector space. If an image input vector is closer to the upper codebook point, no further comparisons with codebook points in the lower half of the space are performed. Tree-structured VQ works best when the codebook has certain symmetries. However, requiring these symmetries reduces the flexibility of codebook design so that the resulting codebook is not optimal for minimizing distortion. Furthermore, while reduced, the computations required by tree-structured VQ can be excessive for collaborative video applications.
In table-based vector quantization (TBVQ), the assignment of all possible blocks to codebook vectors is pre-computed and represented in a lookup table. No computations are required during image compression. However, in the case of 4.times.4 blocks of pixels, with eight-bits allotted to characterize each pixel, the number of table addresses would be 256.sup.16, which is clearly impractical. Hierarchical table-based vector quantization (HTBVQ) separates a vector quantization table into stages; this effectively reduces the memory requirements, but at a cost of additional distortion.
Further, it is well known that the pixel space in which images are originally expressed is often not the best for vector quantization. Vector quantization is most effective when the dimensions differ in perceptual significance. However, in pixel space, the perceptual significance of the dimensions (which merely represent different pixel positions in a block) does not vary. Accordingly, vector quantization is typically preceded by a transform such as a wavelet transform. Thus, the value of eliminating computations during vector quantization is impaired if computations are required for transformation prior to quantization. While some work has been done integrating a wavelet transform into a HTBVQ table, the resulting effectiveness has not been satisfactory.
It is recognized that hardware accelerators can be used to improve the encoding rate of data compression systems. However, this solution is expensive. More importantly, it is awkward from a distribution standpoint. On the Internet, images and Web Pages are presented in many different formats, each requiring their own viewer or "browser". To reach the largest possible audience without relying on a lowest common denominator viewing technology, image providers can download viewing applications to prospective consumers. Obviously, this download distribution system would not be applicable for hardware based encoders. If encoders for collaborative video are to be downloadable, they must be fast enough for real-time operation in software implementations. Where the applications involve collaborative video over heterogeneous networks of general purpose computers, there is still a need for a downloadable compression scheme that provides a more optimal combination of effectiveness, speed, and scalability.