The present invention relates to a system for modifying an image displayed on a display device.
A digitized image is a two-dimensional array of picture elements or pixels. The quality of the image is a function of its resolution, which is measured as the number of horizontal and vertical pixels per unit length. For example, in a 640 by 480 display, a video frame consists of over 300,000 pixels, each of which may be defined by one of 16.7 million colors (24-bit). Such an exemplary display typically includes approximately a million bytes of data to represent an image in uncompressed form.
Because of the potentially large amounts of data in each image, it is generally preferable to use an encoding methodology suitable for encoding the uncompressed image data into a compressed form containing less bytes. Encoded data images are generally preferable for use in transmission across a computer network to a display or storage device. The computer network may be, for example, the interconnection between a storage device and an attached display device in a desktop computer, or a data line interconnecting distant computers together. In either case, it is desirable to minimize the number of bytes of data being transmitted across a computer network because low bandwidth networks may not be capable of transmitting a sufficient number of bytes of uncompressed image data fast enough to display images at full frame video rates (60 frame per second). Further, for systems capable of transmitting uncompressed image data fast enough to display images at full frame video rates, it is desirable to free up unneeded bandwidth for signals transmitted through high bandwidth networks.
Images exhibit a high level of pixel-to-pixel correlation which permits mathematical techniques, such as a spatial Fourier transform of the image data, to reduce the number of bytes required to represent the image. By using the spatial Fourier transform, the number of bytes of data is primarily reduced by eliminating high frequency information to which the human eye is not very sensitive. In addition, since the human eye is significantly more sensitive to black and white detail than to color detail, some color information in a picture may be eliminated without significantly degrading picture quality.
There are numerous encoding methods, otherwise referred to as standards, currently being used to encode video images that reduce the number of bytes required to be transmitted across computer networks while simultaneously maintaining image quality.
The H.261 standard is suitable for encoding image data of moving images, such as video, for transmission across computer networks. The H.261 standard is formally known as xe2x80x9cDigital Processing of Video Signalsxe2x80x94Video Coder/Decoder for Audiovisual Services at 56 to 1536 kbit/s,xe2x80x9d American National Standards (ANSI) T1.314.1991, and incorporated herein by reference. A similar standard known as the ITU-T Recommendation H.263, also incorporated herein by reference, discloses a similar standard for video coding for low bitrate communication.
Referring to FIG. 1, a H.261 source coder 10 receives digital video 11 in the form of a plurality of nonoverlapping 16xc3x9716 pixel blocks at a comparator block 24. Each 16xc3x9716 pixel block of digital video is then further divided into four nonoverlapping 8xc3x978 pixel blocks for calculations.
The source coder 10 has two operational modes. The first operational mode, the intraframe mode, primarily involves a discrete cosine transform (DCT) block 12 transforming each 8xc3x978 pixel block of data to a set of spatial frequency coefficients. The output of the transform block 12 normally is an 8xc3x978 block of data primarily consisting of small numbers and a few large numbers. The spatial frequency coefficients from transform block 12 are inputs to a quantizer block 14 that quantizes the spatial frequency coefficients using a single quantization factor (number). In effect, the quantizer block 14 rounds each spatial frequency coefficient to the nearest multiple of the quantization factor and divides the rounded spatial frequency coefficients by the quantization factor to obtain a data set where the original spatial frequency coefficients are replaced by multiples of the quantization factor. The multiple of the quantization factor for each spatial frequency coefficient is transmitted across a computer network to the decoder (not shown).
The output from the DCT block 12 and quantizer block 14 for large spatial frequency coefficients, which tend to be primarily the lower frequency signal components, is the transmission of a small number representative of the number of multiples of the quantization number. The small spatial frequency coefficients, which tend to be primarily the higher frequency signal components, are normally rounded to zero and thus the quantization multiple is zero. The source coder 10 does not transmit zeros to the decoder. In this manner the number of bytes that need to be transmitted across the computer network to represent an image is significantly reduced.
The second operational mode, the interframe mode, uses a memory characteristic for motion compensation of a slightly moved picture. Each 8xc3x978 set of values from the quantizer block 14 is dequantized by both an inverse quantizer block 16 and an inverse DCT block 18 to obtain an 8xc3x978 block of data that is similar to the original input to the source coder 10. The picture memory block 20 maintains the 8xc3x978 pixel block of unencoded data until the next 8xc3x978 pixel block representative of the same location in the image is processed by the source coder 10. A filter block 22 removes some undesirable artifacts, if desired. The comparator block 24 compares the current 8xc3x978 pixel block against the previous 8xc3x978 pixel block stored in the memory block 20 for the same location of the image.
There are three possible outputs from the comparator 24. First, if the current and previous pixel blocks are the same, then no image data needs to be transmitted to the decoder. Second, if the current and previous pixel blocks are similar, then only the differences need to be transmitted to the decoder. Third, if the current and previous pixel blocks are considerably different then intraframe mode is used to compute the current 8xc3x978 pixel block. For color images, the source coder 10 uses luminance and two color difference components (Y, CB and CR) The H.261 standard requires that the quantization factor be a single constant number.
The control coding block 26 directs the operation of the source coder 10. The outputs of the source coder 10 are as follows:
Line 30a Flag for INTRA/INTER
Line 30b Flag for transmitted or not
Line 30c Quantizer indication
Line 30d Quantizing index for transform coefficients
Line 30e Motion vector
Line 30f Switching on/off of the loop filter
xe2x80x9cClassified Perceptual Coding With Adaptive Quantizationxe2x80x9d, IEEE Transactions On Circuits and Systems for Video Technology, Vol. 6, No. 4, August 1996 is similar to the H.261 standard. This method uses a quantization matrix between the output of the DCT block 32 and the input of the quantizer block 14 to allow selected coefficients, such as high frequency coefficients, to be selectively weighted. For example, selected high frequency coefficients could be adjusted to zero so that they are not transmitted to the decoder.
Motion Picture Experts Group 1 (MPEG-1) is another standard suitable for the transmission of moving images across computer networks. MPEG-1, formally known as xe2x80x9cInformation Technologyxe2x80x94Coding Of Moving Pictures and Associated Audio For Digital Storage Media Up To About 1.5 Mbit/sxe2x80x94xe2x80x9d, ISO/IEC 11172-2, is herein incorporated by reference. Motion Picture Experts Group 2 (MPEG-2) is yet another standard suitable for the transmission of moving images across computer networks. The MPEG-2, formally known as xe2x80x9cInformation Technologyxe2x80x94Generic Coding Of Moving Pictures and Associated Audio Information: Videoxe2x80x9d, ISO/IEC 13818-2, is also incorporated herein by reference. The MPEG-1 and MPEG-2 standards include a matrix of quantizer values that allow the selection of the quantization factor for each value within a pixel block of data to accommodate the variation in sensitivity of the human visual system to different spatial frequencies. Using the quantization matrix permits finer control over the quantization of the spatial frequency coefficients than the single quantization factor used in the H.261 and H.263 standards.
Johnston et al., U.S. Pat. No. 5,136,377, incorporated by reference herein, teach an image compression system, particularly suitable for high definition television, that is optimized to minimize the transmission bitrate while maintaining high visual quality for television. The techniques used involve variable pixel sized blocks, variable quantization error, determination of the frequency of peak visibility, thresholds for textured inputs, directionality, and temporal masking.
More specifically, Johnston et al. describe a DCT-based video image compression system that uses variable sized pixel blocks. Johnson et al. teach that the human visual system has a greater response to lower frequency components than to higher frequency components of an image. In fact, the relative visibility as a function of frequency starts at a reasonably good level at low frequencies, increases with frequency up to a peak at some frequency, and thereafter drops with increasing frequency to below the relative visibility at low frequencies. Accordingly, more quantization error can be inserted at high frequencies than at low frequencies while still maintaining a good image. In addition, Johnston et al. teach that the absolute frequency at which the peak visibility occurs depends on the size of the screen and the viewing distance.
Johnston et al. also describe the use of thresholds for textured inputs. Texture is defined as the amount of AC energy at a given location, weighted by the visibility of that energy. The human visual system is very sensitive to distortion along the edges of an object in an image, but is much less sensitive to distortion across edges. Johnston et al. accounts for this phenomena by introducing the concept of directionality as a component.
Further, Johnston et al. account for a phenomena known as temporal masking. When there is a large change in the content of an image between two frames at a fixed location in the scene, the human visual system is less sensitive to high frequency details at that location in the latter frame. By detecting the occurrence of large temporal differences, the perceptual thresholds at these locations can be increased for the current frame. This results in decreasing the number of bytes that need to be transmitted for a portion of the image.
Daly et al., U.S. Pat. No. 4,774,574, incorporated by reference herein, disclose a system for transmitting a digital image where the spatial frequency coefficients are quantized in accordance with a model of the visibility to the human eye of the quantization error in the presence of image detail. The human visual system is less sensitive to different spatial frequencies in the presence of a nonuniform image than in the presence of a uniform image, referred to as visual masking. Accordingly, Daly et al. teach a method of reducing the bitrate for transmission in those regions of the image to which the human eye is not especially sensitive.
Daly et al., U.S. Pat. No. 4,780,761, incorporated by reference herein, disclose an image compression system by incorporating in its model of the human visual system the fact that the human visual system is less sensitive to diagonally oriented spatial frequencies than to horizontally or vertically oriented spatial frequencies.
Aravind et al., U.S. Pat. No. 5,213,507, incorporated herein by reference, disclose a video signal compression system suitable for MPEG environments. The system develops the quantization parameter for use in encoding a region of an image based on (a) a categorization of the region into one of a predetermined plurality of perceptual noise sensitivity (PNS) classes, (b) a level of psycho-visual quality that can be achieved for the encoded version of the image, and (c) a prestored empirically derived model of the relationship between the PNS classes, the psycho-visual quality levels, and the values of the quantization parameter. PNS indicates the amount of noise that would be tolerable to a viewer of the region. Some characteristics on which PNS class may be based are: spatial activity, speed of motion, brightness of the region, importance of the region in a particular context, presence of edges within the region, and texture of the region, e.g., from xe2x80x9cflatxe2x80x9d to xe2x80x9chighly textured.xe2x80x9d The PNS classes may also include the combination of the characteristics of a region of the image. Aravind et al. also attempt to design a system that minimizes the bitrate based on the content of the image.
All of the aforementioned systems are designed to reduce the bitrate for transmission of moving images over a computer network based on the content of the image and a model of the human visual system. However, all aforementioned systems fail to consider the resultant image quality based on factors outside of the image content and the presumed location of the viewer. What is desired, therefore, is a video encoding system that incorporates the activity of the viewer and particulars of the display device in determining the necessary image quality to be transmitted across the computer network.
The present invention overcomes the aforementioned drawbacks of the prior art by providing a method of encoding video for transmission through a computer network. An encoder receives a video input that includes initial video data and encodes the initial video data as encoded video data, such that the encoded video data comprises fewer bytes than the initial video data. The encoded video data is transmitted through the computer network to a decoder that receives the encoded video data and reconstructs an image representative of the video input for viewing on a display. A sensor senses at least one of viewer information representative of at least one of a location and movement of a viewer, and display information identifying the display. Viewer data representative of the at least one of the viewer information and the display information is transmitted to the encoder to modify the method of encoding the initial video data.
In the preferred embodiment, the viewer information includes data representative of at least one of, how far a viewer is from the display, the angle of the viewer in relation to the display, the portion of the display that the viewer is viewing, movement of the viewer in relation the display, and changes in the portion of the display that the viewer is viewing. The display information includes data representative of at least one of the type of the display and the size of the display. By changing the focus from encoding the video solely based on the content of the video itself to include viewer information and display information, the bandwidth required for transmitting the encoded video data can be further reduced.