Human vision relies on rod photoreceptor cells that respond to very low levels of light and cone photoreceptor cells that respond to color. The cone cells generally respond to three portions of the visible electromagnetic spectrum, namely long wavelength (e.g., generally corresponding to red), medium wavelength (e.g., generally corresponding to green), and short wavelength (e.g., generally corresponding to blue). As such, all colors can be expressed as different combinations of at least three different color components. Generally, color itself is a complex phenomenon that ensues from both the physical aspects of electromagnetic radiation in the visible portion of the spectrum as well as the vision-related and cerebral “mechanisms” used to process such information. For instance, human vision is more responsive to the intensity of light than to the color (chroma) components of light.
Electronic apparatuses that reproduce color images complement the trichromatic nature of human vision by providing three types of light sources. The three types of light sources produce different spectral responses that are perceived as different colors to a human observer. For instance, a cathode ray tube (CRT) provides red, green and blue phosphors to create different colors. Other technologies do not use phosphors, but otherwise reproduce color using light sources that emit at least three kinds of light.
The Commission Internationale de L'Éclairage (CIE) has set forth a comprehensive system that maps the spectral features of light to different perceived colors. In connection therewith, the term “matching function” refers to statistically tabulated response curves (usually to short, medium and long wavelengths) of an “average” viewer to a set of reference lamps at each wavelength. For red, green, and blue, these functions are represented as r(w), g(w) and b(w), respectively, where “w” denotes wavelength. Such reference lamps—or color primaries—define the light sources (typically monitor phosphors) used by an apparatus to reproduce image information having color content. The term “color space” refers to a specification defined by a set of color primaries and matching functions.
An abstract color specification can mathematically map tuples of chromaticities into different colors in the manner described above. However, a number of specific coding systems have been developed to ensure a more efficient coding scheme that can be applied to real-world applications, such as the transmission and presentation of color image information. The real-world application that first confronted the industry was the broadcast and presentation of analog television signals. More recent applications involve the transmission and presentation of digital video information over networks, such as TCP/IP networks (e.g., the Internet). Further, the industry now accommodates the transmission and presentation of high definition (HD) video information in addition to standard definition (SD) video information. The features of a coding system can thus often be traced back to certain problems confronted by the industry at certain times.
Whatever their approach, coding systems address a common set of issues that arise in the reproduction of image information having color content. The following discussion provides an overview of common issues that coding systems are likely to address in one form or another. (As to terminology, the term “image information” is used in this disclosure to represent any information that can be displayed to a user; this term is used broadly to encompass both still image information and moving video information.)
Color Space and Related Considerations
Colors can be specified using three components. An image stream that relies on the transmission of color content using discrete color components is referred to as component video. One common coding approach specifies color using red, green and blue (RGB) components. More formally, the RGB components describe the proportional intensities of the reference lamps that create a perceptually equivalent color to a given spectrum. For example, the R component can be defined by:
      R    =                  ∫                  300          ⁢                                          ⁢          nm                          700          ⁢                                          ⁢          nm                    ⁢                        L          ⁡                      (            w            )                          ⁢                  r          ⁡                      (            w            )                          ⁢                                  ⁢                  ⅆ          w                      ,where L(w) correspond to a given spectrum and r(w) corresponds to a matching function r(w) for the color space. In general, an RGB color space can be specified by the chromatic values associated with its color primaries and its white point. The white point refers to the chromaticity associated with a reference white color.
Computer monitors generally use the RGB model to present color content to users. However, the RGB coding model may be an inefficient choice for the transmission of image information. Accordingly, image information is commonly transmitted to a target apparatus using some coding model other than RGB. Upon receipt, the image information can be transformed into the RGB color space for display, e.g., using a 3×3 affine transformation. As will be described below under the heading “Gamma Considerations,” each R, G, or B component data can also be expressed in terms of its pre-gamma corrected form, referred to as R′, G′ and B′ values. (Generally, as per convention, the prime denotes nonlinear information in this disclosure.)
A common tactic in this regard is to define color by reference to a luminance-related component (Y) and chroma-related components. Luminance general refers to the perceived intensity (brightness) of light. Luminance can be expressed in a pre-gamma-corrected form (in the manner described below under “Gamma Considerations”) to yield its nonlinear counterpart, referred to as “luma” (Y′). The chroma components define the color content of the image information relative to the luma. For example, in the digital domain, the symbol “Cb” corresponds to an n bit integer scaled representation of the difference B′-Y′ (typically from the range of −127 . . . 128 in 8 bit values), and the symbol “Cr” corresponds to an n bit integer scaled representation of the difference R′-Y′. The symbol “Pb” refers to the analog counterpart of Cb, and the symbol “Pr” refers to the analog counterpart of Cr. The symbols ‘Pb’ and ‘Pr’ can also refer to the digital normalized form of Cb or Cr with a nominal range of [−0.5 . . . 0.5]. The component image information defined by CbCr and PbPr may be formally primed (e.g., Cb′Cr′ and Pb′Pr′) when they represent nonlinear information.
Color content can also be communicated as composite video (rather than the above-described component video). Composite signals combine luma and chroma information in one signal. For instance, in the coding system Y′UV, U represents a scaled version of B-Y and V represents a scaled version of R-Y. These luma and chroma components are then processed to provide a single signal (e.g., in the manner set forth in the National Television System Committee (NTSC) format or Phase Alternate Line (PAL) format). The coding system Y′IQ defines another composite coding system formed by transforming the U and V components in a prescribed manner. Generally, the industry has historically promoted the use of Y-related color spaces (Y′CbCr, Y′PbPr, YUV, YIQ, etc.) because reducing color image information in these color spaces can be performed more easily compared to image information expressed in the RGB color space.
It is generally possible to transform color content from one color space to another color space using one or more matrix affine transformations. More formally, the property of metamerism makes it possible to express one set of color space coefficients in terms of another set of matching functions (where “metamers” refer to two spectra which map to the same set of color space coefficients, and hence appear to be perceptually identical—that is, that look like the same color).
Gamma Considerations
Cathode ray tubes (CRTs) do not have a linear response transfer function. In other words, the relationship of voltage applied to a CRT and the resultant luminance produced by the CRT does not define a linear function. More specifically, the predicted theoretical response of a CRT has a response proportional to the 5/2 power law; that is, for a given input voltage “V,” the CRT's resultant luminance “L” can be computed as L=V2.5.
In application, the source of image information (such as a video camera) commonly pre-compensates the image information by applying a transfer function to the image information. The “transfer function” is approximately the inverse function of the CRT luminance response. This transfer function applied at the source—commonly referred to as the encoding transfer function—produces “gamma corrected” nonlinear image information. When the nonlinear signal is passed through the display device, a linear luminance is produced. According to the notation described above, the nonlinear (or precompensated) image information is denoted by priming its components, e.g., Y′Cb′Cr′.
It is common to transmit image information in nonlinear (compensated) form. The presentation device (e.g., CRT) of the receiving apparatus can, due to its inherent nonlinearity, complement the encoding transfer function to provide appropriately transformed color content for consumption.
It is common to adjust the exponent of the encoding transfer function to account for the condition in which the image information is likely to be viewed. For instance, video information displayed on conventional televisions is typically presented in a dim viewing environment common in a home setting, while image information displayed on conventional computer monitors is typically presented in a bright viewing environment common to an office setting. Different transfer function adjustments are appropriate to these different viewing environments. For this reason, television video sources typically use a transfer function that is based on the built-in assumption that the image information will be presented in a dim viewing condition. This means that the transfer function applied by the source will commonly under-compensate for the inherent nonlinearity of the CRT.
As another special consideration, encoding of image information using a transfer function commonly applies a special approximation function for the low voltage portion of the function. Namely, encoding techniques commonly provide a linear segment in this portion to reduce the effects of noise in the imaging sensor. This segment is referred to as a “linear tail,” having a defined “toe slope.”
Sampling and Alignment of Chroma Information Relative to Luma Information
As noted above, human vision is more responsive to light intensity than the chromatic components of light. Coding systems take advantage of this fact to reduce the amount of chroma (Cb′Cr′) information that is coded relative to the amount of luma information (Y′). This technique is referred to as chroma sub-sampling. A numeric notion represented generically as L:M:N can be used to express this sampling strategy, where “L” represents the sampling reference factor of the luma component (Y′), and “M” and “N” refer to the chroma sampling (e.g., Cb and Cr, respectively) relative to the luma sampling (Y′). For instance the notation 4:4:4 can denote Y′CbCr data in which there is one chroma sample for every luma sample. The notation 4:2:2 can denote Y′CbCr data in which there is one chroma sample for every two luma samples (horizontally). The notation 4:2:0 can denote Y′CbCr data in which there is one chroma sample for every two-by-two cluster of luma samples. The notation 4:1:1 can denote Y′CbCr data in which there is one chroma sample for every four luma samples (horizontally).
In those circumstances where the coding strategy provides more luma information than chroma information, a decoder can reconstruct the “missing” chroma information by performing interpolation based on the chroma information that is supplied. More generally, downsampling refers to any technique that produces fewer image samples in comparison with an initial set of image samples. Upsampling refers to any technique that produces more image samples in comparison with the initial set of image samples. Thus, the above-described interpolation defines a type of upsampling.
Coding strategies also specify the manner in which chroma samples are spatially “aligned” to the corresponding luma samples. Coding strategies differ in this regard. Some align the chroma samples with the luma samples, such that the chroma samples are directly positioned “over” the luma samples. This is referred to as cositing. Other strategies position chroma samples in interstitial spaces within the two-dimensional array of luma samples. FIGS. 10-12 (to be discussed below in turn) show different sampling and alignment strategies for presenting luma and chroma information.
Quantization Considerations
Quantization refers to the methodology whereby discrete numeric values are assigned to the signal amplitudes of color components. In the digital domain, the numeric values span a prescribed range (gamut) of color space values in a prescribed number of steps. It is common, for instance, to use 255 steps for describing each component value, such that each component can assume a value from 0 to 255. It is common to express each color value using 8 bits, although color can also be expressed with higher precision (e.g., 10 bits, etc.), as well as with lower precision.
Coding strategies often allocate portions on both ends of the range of quantization levels for representing back levels and white levels, respectively. That is, a coding strategy will often define a reference black level and a reference white level, but also allocate coding levels beyond these reference levels for expressing values that swing beyond reference black and white levels. For example, an 8-bit coding strategy may assign the level 16 to black and the level 235 to white. The remaining levels that are lower than 16 define so-called “toe room,” while the remaining levels over 235 define so-called “head room.”
Interlaced vs. Progressive Representation Considerations
Conventional television signals are scanned in interlaced fashion. In interlacing, a first field of a video frame is captured, followed, shortly thereafter, by a second field of the video frame (e.g., 1/50 or 1/60 seconds thereafter). The second field is vertically offset relative to the first field by a slight amount, such that the second field captures information in the interstitial spaces between scanning lines of the first field. So-called bob interlacing is one known type of interleaving strategy. The complete video frame is composed by presenting the first and second fields in quick succession so that they are perceived by a human viewer as a single frame of information.
However, computer monitors and other presentation equipment display image information in progressive, not interleaved, fashion. Thus, in order for an apparatus to present interlaced information on a computer monitor, it must display progressive frames at the interlaced field rate by interpolating the data for the opposite field (a process referred to as “deinterlacing”). For example, to display an interlaced field, it must interpolate the “missing” data for the spatial location between the lines by examining the fields on either side. The non-interlaced image format is referred to as the “progressive” format.
Additional information regarding each of the above topics may be found in a number of introductory texts, such as Charles Poyton's well-regarded Digital Video and HDTV.
Compounding the above-described complexity, the industry accommodates a large number of different formal standards for representing image information. Standards have been promulgated by a number of organizations and committees, including the International Telecommunications Union (ITU), the European Broadcasting Union (EBU) (which also promotes Digital Video Broadcasting, or DVB), the Audio Engineering Society (AES), the Advanced Television Systems Committee, Inc. (ATSC), the Society of Motion Picture and Television Engineers (SMPTE), Sequential couleur avec mÉmoire (SECAM), National Television System Committee (NTSC), and so forth.
Each of these organizations has carved out particular combinations of coding features from the above-described universe of possible coding options. As such, as appreciated by the present inventors, standards generally differ as to their definition and application of: color primaries; transfer functions; intended viewing conditions; transfer matrices; toe room and head room specifications; chroma subsampling and alignment strategies, and so forth. The color primaries (together with the white point reference) define the basic color space of a standard. The transfer function determines how the standard converts between linear image information and nonlinear information. The intended viewing conditions define the assumptions that the standard makes about the viewing environment in which the image information is likely to be consumed (such as the assumption that television will be viewed in a dimly lit home setting). The viewing conditions change the effective gamma and brightness (the black level) and contrast (the white level) of the image information. The transfer matrices determine how the standard converts between different color spaces (e.g., from Y′YbYr to RGB color spaces). The head room and toe room specifications determine the quantization levels that the standard allocates to represent ranges of black and white colors. The chroma sub-sampling and alignment strategies specify the manner in which the chroma information is sub-sampled and positioned relative to the luma information.
Existing standards-related documentation sets forth the requirements of each standard in exacting detail. Representative standards include:                ITU-R Recommendation BT.470 is an international standard that provides specifications for analog and monochrome televisions apparatus.        ITU-R Recommendation BT.601 is an international standard that defines studio digital coding of image information. This standard uses a Y′CbCr coding of image information.        ITU-R Recommendation BT.709 is an international standard that defines studio coding of high definition video information. High definition (HD) content represents video content that is higher than standard definition (SD), typically 1920×1080, 1280×720 and so forth.        SMPTE 170M is a standard that defines coding of composite analog video information (e.g., NTSC).        SMPTE 240M is a standard that defines coding of analog high definition video information.        IEC 61966-2-1 (sRGB) is a standard for coding image information into 255 levels using an 8-bit quantization scheme.        IEC 61966-2-2 (scRGB) is a standard which defines a linear form of sRGB and significantly expands the color gamut of sRGB.        ISO/IEC 13818 (MPEG-2) is a standard for coding audio and video signals in compressed form.        ISO 10918-1 (JPEG) is a standard for lossy compressing still image information.        
The great variety of coding standards in use today contributes to a number of difficulties in the coding, transmission and processing of image information. By way of overview, video processing pipelines associated with specific apparatuses are often designed to process a particular type of signal having defined formatting; in this limited role, these apparatuses may correctly process such image information in a reliable manner. However, in the context of the wider universe of image information in use today, these apparatuses may lack mechanisms for interpreting the color formatting of other kinds of image information, and for reliably propagating this formatting information through the pipeline. More precisely, the video pipeline may receive information defining certain aspects of the color formatting applied to the received image information, but, as appreciated by the present inventors, the video pipeline may lack suitable mechanisms for reliably propagating this color information down the pipeline to downstream components in the pipeline. As a result, such formatting information is “lost” or “dropped.” Downstream components can address the paucity of information pertaining to the color formatting by “guessing” at the formatting information. When the components guess incorrectly, the pipeline produces image information in a suboptimal or even incorrect manner.
FIG. 1 is presented as a vehicle for further explaining the above potential problem. FIG. 1 shows a high level representation of a video processing pipeline 100. The pipeline 100 includes conventional processing stages defined by an input stage 102, a processing stage 104 and an output stage 106. As to the input stage 102, input source 108 represents any source of image information. The source 108 can generally comprise newly captured image information (e.g., created by a camera or scanner), or previously captured image information that is presented to the input stage 102 via some channel (e.g., received from a disc, over an IP network, etc.). In the former case, capture processing functionality 110 can perform any kind of preliminary processing on the image information received from the source 108. In the latter case, the decoder functionality 112 performs any kind of stream-based information extraction and decompression to produce image data. Generally, such processing can include separating image information from audio information in the received information, uncompressing the information, and so forth. As to the processing stage 104, processing functionality 114 performs any kind of processing on the resulting image information, such as mixing multiple streams of image information together into a composite signal. As to the output stage, output processing functionality 116 represents any kind of processing performed on the processed image information in preparation for its output to an output device 118. Output device 118 may represent a television, a computer monitor, and so forth. Output devices may also represent storage devices. Further, an output “device” (or output functionality 116) can provide compression and formatting functionality (such as multiplexers) that prepare the information for storage on a device, or for distribution over a network.
The bottom row of blocks in FIG. 1 summarizes the above-described deficiencies in known systems. Block 120 indicates that the pipeline functionality (110, 112, 114, 116) fails to accurately interpret the color formatting applied to input signals and/or fails to reliably propagate color information down the pipeline to downstream components. For instance, the pipeline 100 may receive image information that has been coded using a prescribed format. The received information may include certain fields that identify features of the formatting that was used, or these features can be deduced based on other telltale properties of the received information. However, because of the plethora of standards in use, the initial stages of the pipeline 100 lack functionality for properly interpreting this information and passing it to downstream components in the video pipeline 100. As a result, this coding information becomes immediately lost. This can result in the situation in which image information is passed to downstream pipeline components with no guidelines on how the components should interpret this image information; it is essentially just 1's and 0's.
Block 122 represents the manner in which the video pipeline 100 deals with the above difficulty. Namely, the functional components that lack guidelines on how to interpret the color content in the image information often make “guesses” as to how to interpret it. Some guesses are accurate but others are not. To name but a few examples, the video pipeline may make inaccurate assumptions regarding the transfer function that has been applied to the image information (perhaps based on image size), the lighting conditions assumptions inherent in the image information, the chroma sub-sampling scheme used by the image information (based on the data format), and so forth.
Block 124 represents the potential consequences of incorrect guesses. Namely, incorrect guesses can result in sub-optimal or incorrect display quality. An image presentation may appear as having “unnatural” colors or having motion artifacts. Or it may appear as unduly “contrasty,” distorted, inappropriately cropped, and so forth
There is accordingly a need for a more satisfactory technique for processing image information having color content.