With the availability of high-performance personal computers and popularity of broadband Internet connections, the demand for Internet-based video applications such as video conferencing, video messaging, video-on-demand, etc. is rapidly increasing. To reduce transmission and storage costs, improved bit-rate compression/decompression (“codec”) systems are needed. Image, video, and audio signals are amenable to compression due to considerable statistical redundancy in the signals. Within a single image or a single video frame, there exists significant correlation among neighboring samples, giving rise to what is generally termed “spatial correlation”. Also, in moving images, such as full motion video, there is significant correlation among samples in different segments of time such as successive frames. This correlation is generally referred to as “temporal correlation”. There is a need for an improved, cost-effective system and method that uses both spatial and temporal correlation to remove the redundancy in the video to achieve high compression in transmission and to maintain good to excellent image quality, while adapting to change in the available bandwidth of the transmission channel and to the limitations of the receiving resources of the clients.
A known technique for taking advantage of the limited variation between frames of a motion video is known as motion-compensated image coding. In such coding, the current frame is predicted from the previously encoded frame using motion estimation and compensation, and only the difference between the actual current frame and the predicted current frame is coded. By coding only the difference, or residual, rather than the image frame itself, it is possible to improve image quality, for the residual tends to have lower amplitude than the image, and can thus be coded with greater accuracy. Motion estimation and compensation are discussed in Lim, J. S. Two-Dimensional Signal and Image Processing, Prentice Hall, pp. 497-507 (1990). However, motion estimation and compensation techniques have high computational cost, prohibiting software-only applications for most personal computers.
Further difficulties arise in the provision of a codec for an Internet streamer in that the bandwidth of the transmission channel is subject to change during transmission, and clients with varying receiver resources may join or leave the network as well during transmission. Internet streaming applications require video encoding technologies with features such as low delay, low complexity, scalable representation, and error resilience for effective video communications. The current standards and the state-of-the-art video coding technologies are proving to be insufficient to provide these features. Some of the developed standards (MPEG-1, MPEG-2) target non-interactive streaming applications. Although H.323 Recommendation targets interactive audiovisual conferencing over unreliable packet networks (such as the Internet), the applied H.26x video codecs do not support all the features demanded by Internet-based applications. Although new standards such as H.263+ and MPEG-4 started to address some of these issues (scalability, error resilience, etc.), the current state of these standards is far from being complete in order to support a wide range of video applications effectively.
Known image compression techniques such as JPEG, MPEG, and P*64 use transform techniques such as discrete cosine transform (DCT) to project the video sample as appropriate basis functions and then encode the resulting coefficients. These transforms are based on transforming a block of video data, such as 8×8 pixels, for JPEG or MPEG and therefore have a block constraint and fail to exploit interblock correlations. The discrete cosine transform or the related Fourier transform work under the assumption that the original time domain signal is periodic in nature. Therefore, it has difficulty with signals having transient components—that is signals that are localized in time; this is especially apparent when a signal has sharp transitions.
To overcome these problems, codecs can instead use basis functions that are localized both in time and frequency called “wavelets”. The wavelet representation is very suitable for non-stationary signals such as a sequence of video images having motion. The technique of compression by quantization and encoding the wavelet coefficients relies on the assumption that details at high resolution are less visible to the eye and therefore can be eliminated or reconstructed with lower order precision while still maintaining good to excellent display visual quality. Thus, the wavelet coefficients are coded according to their location in frequency bands and their importance for the quality of the final reconstructed image. U.S. Pat. No. 6,091,777 to Guetz et al. and U.S. Pat. No. 6,272,180 to Lei provide examples of codecs that use wavelet transformations. However, the codecs taught by Guetz et al. and Lei use wavelet transformations in only two dimensions, applied to only one frame at a time.
Due to very heterogeneous networking and computing infrastructure, highly scalable video coding algorithms are required. A video codec should provide reasonable quality to low-performance personal computers connected via a dial-up modem or a wireless connection, and high quality to high-performance computers connected using T1. Thus the compression algorithm is expected to scale well in terms of both computational cost and bandwidth requirement.
Real Time Protocol (RTP) is most commonly used to carry time-sensitive multimedia traffic over the Internet. Since RTP is built on the unreliable user datagram protocol (UDP), the coding algorithm must be able to effectively handle packet losses. Furthermore, due to low-delay requirements of the interactive applications and multicast transmission requirements, the popular retransmission method widely deployed over the Internet cannot be used. Thus the video codec should provide high degree of resilience against network and transmission errors in order to minimize impact on visual quality.
Computational complexity of the encoding and decoding process must be low in order to provide reasonable frame rate and quality on low-performance computers (PDAs, hand-held computers, etc.) and high frame-rate and quality on average personal computers. As mentioned, the popularly applied motion estimation and motion compensation techniques have high computational cost prohibiting software-only applications for most personal computers.