Exemplary Problems in the Video Processing Arts
Video processing technology in use today is best understood by tracing the evolution of such technology over the years. Features have been added at various stages in the evolution to address problems facing the industry at those times. To maintain compatibility and consistency, later technology may have retained some of these features, even though the problems that the features were designed to solve had since vanished. As a result, current technology can be viewed as an agglomeration of such historical-based features, reflecting a series of prior problems facing the industry at different times, compromises among standards groups, changing technological-based constraints and opportunities, and so forth.
One consequence of the above-described nature of video processing technology is that those working in the field have developed entrenched mindsets regarding certain aspects of video processing technology. There are fixed notions regarding how to interpret certain video information and fixed notions regarding how to “correctly” process such video information. As appreciated by the present inventors, many of these settled notions are not well founded and need to be reconsidered.
Chief among the fixed notions is that video information should generally be processed in the form that it is received, either from a broadcast source, a storage medium (e.g., a DVD disc), or other source. However, many video standards were not designed with the expectation that the video information would be processed prior to display. For example, conventional televisions do not accommodate complex processing fuctionality; these devices simply receive and display video information. As such, the form that the video information is received may not readily accommodate the efficient processing of such information.
As a result, the direct application of standard processing algorithms on many accepted forms of video information produces various artifacts. Those skilled in the art have taken note of these artifacts on some occasions. However, rather than questioning the basic premises of the techniques being employed, these practitioners have often resorted to local patches to remedy the problems. These solutions may mask the problems in certain application-specific situations, but do not solve the problems in general.
For example, video information is often received by a video processing pipeline in a form that is nonlinear, interlaced, chroma subsampled, and expressed in some variant of a luma-related color space (e.g., Y′U′V′ information). (The term “nonlinear” means that that there is a nonlinear relationship between an incoming signal and resultant output brightness produced from this signal; other terms in the preceding sentence will be explicated fuilly below.) Practitioners may attempt to apply various linear-type processing algorithms to this information to modify it in a prescribed manner, such as by resizing the video information, combining the video information with other information (e.g., compositing), and so forth. As appreciated by the present inventors, many of these algorithms do not provide optimal or even correct results when processing nonlinear video information of this nature. Working only with interlaced chroma subsampled 4:2:2 or 4:2:0 information (to be defined below) compounds these poor results. For instance, processing information in 4:2:2 or 4:2:0 can result in the propagation of errors through different stages of the video processing pipeline.
The deficiencies in the processed results are manifested in various artifacts, which may or may not be apparent to the naked eye. Again, those skilled in the art may have noticed the poor results, but have not identified the causes. In some cases, this may be due to practitioners' failure to fully understand the complex nature of many video coding standards. In other cases, practitioners may be unaware that they are using linear algorithms to process nonlinear information; indeed, in some cases the practitioners may incorrectly believe that they are dealing with linear information. Also, the general focus in the video processing art has been aimed at the production of image information, not necessarily the intermediary processing and correction of such information.
The application of linear-type algorithms to nonlinear information is just one example of the above-described entrenched mindset in the video processing art. As will be described below, many other techniques have become fixed which do not produce optimal results, such as in the case of dithering. For example, practitioners may attempt to remedy artifacts caused by some dithering-quantization algorithms by adding a small amount of random noise to input image information and then quantizing the resultant noisy image. These techniques assess the quantization error by then computing the difference between the noisy image and the quantized result. This may have the effect of curing the dithering artifacts, but at the price of making the output image nosier in proportion to the amount of random noise added to the original image information.
There are many other instances of settled ideas in the video processing art that continue to be applied, because of custom and familiarity, without recognition of their significant but subtle drawbacks. The general theme of the improvements described herein involves the reconsideration of these rigid ideas, coupled with the design of alternative solutions.
The video processing field is rich in terminology. According, as a preliminary matter, a brief introduction to certain topics in the video processing field will be set forth below to assist the reader. For instance, several of the terms used above in passing (linear, interlaced, luma, chroma-subsampled, etc.) are defined below. As a general matter of terminology, the term “image information” will be used throughout this document to represent a broad class of information that can be rendered as any kind of visual output, including, but not limited to, motion video information.
Background Concepts                Color Space and Related Considerations        
Colors can be specified using three components. An image stream that relies on the transmission of color content using discrete color components is referred to as component video. One common specification defines color using red, green and blue (RGB) components. More formally, the RGB components describe the proportional intensities of the reference lamps that create a perceptually equivalent color to a given spectrum. In general, an RGB color space can be specified by the chromatic values associated with its color primaries and its white point. The white point refers to the chromaticity associated with a reference white color.
Electronic apparatuses that reproduce color images complement the trichromatic nature of human vision by providing three types of light sources. The three types of light sources produce different spectral responses that are perceived as different colors to a human observer. For instance, a cathode ray tube (CRT) provides red, green and blue phosphors to create different colors, thus complementing some variant of the RGB color space discussed above. Other technologies do not use phosphors, but otherwise reproduce color using light sources that emit at least three kinds of light.
However, the RGB coding model is not an efficient choice for the transmission of image information, and does not conform well with some older standards. Accordingly, image information is commonly transmitted to a target apparatus using some coding model other than RGB. Upon receipt, the image information can be internally transformed by a display apparatus into a RGB-related color space for presentation. As will be described below under the heading “Gamma Considerations,” each R, G, or B component data can be expressed in terms of its pre-gamma corrected form, referred to as R′, G′ and B′ values. (Generally, as per convention, the prime denotes nonlinear information in this disclosure.)
A common tactic in this regard is to define color by reference to a luminance-related component (Y) and chroma-related components. Luminance general refers to the perceived intensity (brightness) of light. Luminance can be expressed in a pre-gamma-corrected form (in the manner described below under “Gamma Considerations”) to yield its nonlinear counterpart, referred to as “luma” (Y′). The chroma components define the color content of the image information relative to the luma. For example, in the digital domain, the symbol “Cb” corresponds to an n bit integer scaled representation of the difference B′−Y′ (typically from the range of −127 . . . 128 in 8 bit values), and the symbol “Cr” corresponds to an n bit integer scaled representation of the difference R′−Y′. The symbol “Pb” refers to the analog counterpart of Cb, and the symbol “Pr” refers to the analog counterpart of Cr. The symbols Pb and Pr can also refer to the digital normalized form of Cb or Cr with a nominal range of [−0.5 . . . 0.5]. The component image information defined by CbCr and PbPr may be formally primed (e.g., Cb′Cr′ and Pb′Pr′) as they represent nonlinear information. However, since Pb, Pr, Cb, or Cr always refer to nonlinear data, the primed notation is often dropped as a matter of convenience and convention (for example, the notation Y′PbPr is used instead of Y′Pb′Pr′).
Color content can also be communicated as composite video (rather than the above-described component video). Composite signals combine luma and chroma information in one signal. For instance, in the coding system Y′UV, U represents a scaled version of B−Y and V represents a scaled version of R−Y. These luma and chroma components are then processed to provide a single signal. The coding system Y′IQ defines another composite coding system formed by transforming the U and V components in a prescribed manner. One reason that the industry has historically promoted the use of Y-related color spaces (Y′CbCr, Y′PbPr, YUV, YIQ, etc.) is because reducing color image information in these color spaces can be performed more easily compared to image information expressed in the RGB color space. These color spaces are also backward compatible with older standards developed for black and white image information. The term “luma-related information” generally refers to any color space that has a brightness-related component and chroma-related components, and encompasses at least all of the color spaces mentioned above.
It is generally possible to transform color content from one color space to another color space using one or more matrix affine transformations. More formally, the property of metamerism makes it possible to express one set of color space coefficients in terms of another set of matching functions (where “metamers” refer to two spectra which map to the same set of color space coefficients, and hence appear to be perceptually identical—that is, that look like the same color).                Gamma Considerations        
Cathode ray tubes (CRTs) do not have a linear response transfer function. In other words, the relationship of voltage applied to a CRT and the resultant luminance produced by the CRT does not define a linear function. More specifically, the predicted theoretical response of a CRT has a response proportional to the 5/2 power law; that is, for a given input voltage “V,” the CRT's resultant luminance “L” can be computed as L=V2.5. The transfer function is also referred to herein as a “gamma response function,” and the exponent of the voltage signal is referred to as the “gamma.”
On the other hand, when image information is captured by a camera or generated by a 3-D rendering system, image information is expressed in a linear RGB color space, meaning that there is a linear relationship between incoming signal and output brightness. To address the disparity between the linearity of the camera and the nonlinearity of the display, cameras conventionally pre-compensate the signal they produced by applying the inverse of the gamma. In other words, the transfer function of the camera (sometimes referred to as the encoding transfer function) is approximately the inverse function of the CRT luminance response. The result of the application of the encoding transfer function (or the reverse gamma) is to produce “gamma-corrected” image information which is nonlinear in form. When the nonlinear signal is passed through the display device, a close-to-linear luminance is produced. Once again, according to the notation described above, the nonlinear (or precompensated) image information is denoted by priming its components, e.g., R′G′B′ or Y′CbCr (where the primes on the Cb and Cr components are implied).
It has thus become commonplace and standard to store and transmit image information in its luma-chroma nonlinear (compensated) form. To maintain compatibility, any source producing a signal to be displayed on a CRT should also first apply the inverse function to the signal.
As a special consideration, encoding of image information using a transfer function commonly applies a special approximation function for the low voltage portion of the function. Namely, encoding techniques commonly provide a linear segment in this portion to reduce the effects of noise in the imaging sensor. This segment is referred to as a “linear tail,” having a defined “toe slope.” This segment improves the quality of image information presented on actual CRTs, as these devices have linear luminance-voltage responses near 0 due to the physical construction of these devices.                Sampling and Alignment of Chroma Information Relative to Luma Information        
Human vision is more responsive to changes in light intensity than the chromatic components of light. Coding systems take advantage of this fact to reduce the amount of chroma (CbCr) information that is coded relative to the amount of luma information (Y′). This technique is referred to as chroma sub-sampling. A numeric notation represented generically as L:M:N can be used to express this sampling strategy, where “L” represents the sampling reference factor of the luma component (Y′), and “M” and “N” refer to the chroma sampling (e.g., Cb and Cr, respectively) relative to the luma sampling (Y′). For instance the notation 4:4:4 can denote Y′CbCr data in which there is one chroma sample for every luma sample. The notation 4:2:2 can denote Y′CbCr data in which there is one chroma sample for every two luma samples (horizontally). The notation 4:2:0 can denote Y′CbCr data in which there is one chroma sample for every two-by-two cluster of luma samples. The notation 4:1:1 can denote Y′CbCr data in which there is one chroma sample for every four luma samples (horizontally).
In those circumstances where the coding strategy provides more luma information than chroma information, a decoder can reconstruct the “missing” chroma information by performing interpolation based on the chroma information that is supplied. More generally, downsampling refers to any technique that produces fewer image samples in comparison with an initial set of image samples. Up-sampling refers to any technique that produces more image samples in comparison with the initial set of image samples. Thus, the above-described interpolation defines a type of up-sampling.
Coding strategies also specify the manner in which chroma samples are spatially “aligned” to the corresponding luma samples. Coding strategies differ in this regard. Some align the chroma samples with the luma samples, such that the chroma samples are directly positioned “over” the luma samples. This is referred to as cositing. Other strategies position chroma samples in interstitial spaces within the two-dimensional array of luma samples.                Quantization Considerations        
Quantization refers to the methodology whereby discrete numeric values are assigned to the signal amplitudes of color components (or black and white information). In the digital domain, the numeric values span a prescribed range (gamut) of color space values in a prescribed number of steps. It is common, for instance, to use 255 steps for describing each component value, such that each component can assume a value from 0 to 255. It is common to express each color value using 8 bits.
Converting from a high precision number to a lower precision number can sometimes produce various artifacts. Various error dispersion algorithms have been devised to address this problem, such as the Floyd-Steinberg algorithm. Error dispersion algorithms can distribute the errors produced by the round-off effects of quantization to neighboring pixel locations. Further background information regarding the Floyd-Steinberg algorithm is presented within the body of the Detailed Description to follow.                Interlaced vs. Progressive Representation Considerations        
Originally, televisions only displayed only black and white image information in top-down progressive sweep fashion. Today, conventional television signals are scanned in interlaced fashion. In interlacing, a first field of a video frame is captured, followed, shortly thereafter, by a second field of the video frame (e.g., 1/50 or 1/60 seconds thereafter). The second field is vertically offset relative to the first field by a slight amount, such that the second field captures information in the interstitial spaces between scanning lines of the first field. Video information is presented by displaying the first and second fields in quick succession so that the video information is generally perceived by a human viewer as a single contiguous flow of information.
However, computer monitors and other presentation equipment display image information in progressive, not interleaved, fashion. Thus, in order for an apparatus to present interlaced information on a computer monitor, it must display progressive frames at the interlaced field rate by interpolating the data for the opposite field (a process referred to as “de-interlacing”). For example, to display an interlaced field, it must interpolate the “missing” data for the spatial location between the lines by examining the fields on either side. The term “progressive format” refers generally to any non-interlaced image format.
Image information (e.g., from a video camera) is typically stored in an interlaced form, e.g., where the first field is separately stored (semantically) from the second field. If image information is simply to be displayed on an interlaced TV display, its Y′UV interlaced information can be passed directly to the CRT. The CRT internally converts the Y′UV information to R′G′B′ information and drives the output guns using this signal.
Interlacing is advantageous because it doubles the effective vertical resolution of image information. However, interlacing can also introduces artifacts. This is because objects can move at 60 hz, but, in interlaced presentation, only half of the information is shown every 30 hz. The resultant artifact produced by this phenomenon is sometimes referred to as “feathering.” The artifact manifests itself particularly in the display of high motion video, where objects appear to separate into even and odd lines.
Additional information regarding each of the above bullet point topics may be found in a number of introductory texts, such as Charles Poyton's well-regarded Digital Video and HDTV (Morgan Kaufmann Publishers, 2003).