In the early days of color rendering, e.g. for television program display, the relationship between the content creation side (e.g. the camera operator), and the color rendering side (e.g. display on a television or computer display) was simple, and fixed by rigid technical principles. A so called standard CRT display was defined, which had particular phosphors, a certain gamma 2.2 tone reproduction curves (TRC), with 256 approximately visually equidistant driving steps etc. There are a number of fundamental color reproduction questions which were in this manner addressed, i.e. should a color rendering system be optimized to the (best) human viewer, and more importantly, should the color rendering capabilities (and in particular the color description/communication standard) be prescribed/determined (mostly) by the color capturing (camera) side or the color rendering (display) side.
A number of approximations were introduced at the time, as the ground rules for television colorimetry for the decades to come. Taking the physical display constraints of the era of the first color television into account, the first displays and displayed signals were optimized so that they would yield an ideal picture to the viewer, given the size, brightness etc. of the CRTs available at that time (NTSC, the late 1940s early 1950s: resolution fine enough for typical viewing distance, enough driving steps to just noticeable difference (JND) to perceptually reach good, indiscriminable black starting from the white luminances at the time, etc.).
Then, given that standard display of that time, which was a small, dark CRT, the rules for the content production side were laid down for converting captured scenes in reasonably looking pictures on the display, for most scenes (similar considerations took place in the world of analog photography, in which a scene had to be rendered in an often low quality photo print, which never had a contrast above 100:1, imperfect colors, etc.). E.g., even though theoretically one would need a spectral camera to measure a real life color scene (given its variable illumination), as an approximation, if one knows on which device the color is to be displayed on, camera sensitivity curves can be determined.
Images captured with such camera sensitivity curves are then supposed to reconstruct a similarly looking picture on the display, at least emulating at the same time the illumination of the scene at the capturing side, but in practice there will be errors. In addition, these camera sensitivity curves will have negative lobes. Although one could try to reproduce these theoretically optimal curves exactly with optical filter combinations, in practice (also given that the viewer does not know which colors exactly occur in the scene) matrixing will suffice to make the colors look reasonable.
Several content creation side professionals, like the camera operator and a color grader/corrector, have to do their magic with parametric transformations to make the finally encoded images look optimal when displayed. For example, what is usually done by a color corrector (in the video world where different video feeds are combined) is that the color corrector looks at the white points of the different inputs (one global rather severe type of colorimetric image error), and matches the white points of the different inputs by increasing slightly, for example, the blue contributions of pixels, whilst also looking at critical colors like faces. In movie material, further artistic considerations may be involved, e.g., a slightly bluish look for night scenes may be casted, which, if not already largely created by a color filter matching the film characteristics, will typically be done in post production by a color grader. Another example, which may typically involve also tweaking the tone reproduction curves, is to make the movie look more desaturated, i.e., to give it a desolate look.
It is of even higher importance to take care of the tone reproduction curve gamma behavior. One might suspect that just applying a 0.45 anti-gamma correction to encode the captured linear sensor data will suffice, but apart from that, the larger dynamic range of a typical scene always has to be mapped somehow to the [0-255] interval. Tone reproduction curve tweaking will also result in, for example, a coarser, high contrast look, darker or more prominent shadows, etc. The camera operator typically has tunable anti-gamma curves available, in which the camera operator may set knee and shoulder points, etc., so that the captured scene has a good look (typically somebody looks at the captured images on a reference monitor, which used to be a CRT and may now be an LCD). In wet photography, the same can be realized with “hardware” processing, such as printing and developing conditions to map faces onto zone VI of the Adams zone system. However, nowadays there is often a digital intermediate which is worked on. Even cinematographers that love shooting on classical film stock, nowadays have available to them a digital video auxiliary stream (which can be very useful in the trend of increased technical filming, in which a lot of the action may, for example, be in front of a green screen). So in summary, apart from taking the actual room conditions at the viewer's side to be a given to be ignored, the whole color capturing system is designed around a “calibrated ideal display”, which is taken into account as a fixed given fact when the content creator creates his images.
The problem is that this was already very approximative in those days. The reasoning was like “if we do a bad job reproducing a scene on photographic paper anyway, we may relax all requirements regarding accuracy, and apply a more subjective definition of the technical mapping from scene to rendering, taking into account such principles as reasonable recognizability of the imaged scenes, consumer appreciated vivid color rendering, etc.” However, this technology of image encoding (e.g., as prescribed in PAL, or MPEG2) should be understood as co-existing with a number of critical questions, like: “what if one changes the illumination of the captured scene, be it the illuminance or the white point, or the spatial distribution, or the special characteristics”, “what about the errors introduced due to differences in illumination of the scene and the viewing environment, especially when seen in the light of a human viewer adapted to the scene vs. viewing environment”, etc.
These problems and resulting errors became aggravated when displays started changing from the standard CRT in a standard living room, to a range of very different displays and viewing environments (e.g., the peak white luminance of displays increased). Note that, as used herein, the phrase “peak white luminance of a display” and the expressions “display white luminance” and “display peak brightness (PB_D)” are interchangeable, with similar meaning.
To further assist with the comprehension of material disclosed herein, the following brief discussion is included. Until a couple of years ago, all video was encoded according to the so-called low dynamic range (LDR) philosophy, also called standard dynamic range (SDR). That meant, whatever the captured scene was, that the maximum of the code (typically 8 bit luma Y′=255; or 100% voltage for analog display driving) should by standardized definition correspond to, i.e., be rendered on, a display with a peak brightness (PB) (i.e., the brightest white color it can render) being by standard agreement 100 nit. If people bought an actual display which was a little darker or brighter, it was assumed that the viewer's visual system would adapt so that the image would still look appropriate and even the same as on the reference 100 nit display, rather than, e.g., annoyingly too bright (in case one has, e.g., a night scene in a horror movie which should have a dark look).
Of course, for practical program making this typically meant maintaining a tight control of the scene lighting setup, since even in perfectly uniform lighting the diffuse reflection percentage of various objects can already give a contrast ratio of 100:1. The black of such a SDR display may typically be 0.1 nit in good circumstances, yet 1 nit or even several nits in worst circumstances, so the SDR display dynamic range (the brightest white divided by the darkest viewable black) would be 1000:1 at best, or worse, which corresponds nicely to such uniform illuminated scenes, and an 8 bit coding for all the required to be rendered pixel grey values or brightnesses, having a gamma of approximately 2.0, or encoding inverse gamma 0.5. Rec. 709 was the typically used SDR video coding.
Typically also cameras had problems capturing simultaneously both very bright and rather dark regions, i.e., a scene as seen outside a window or car window would typically be clipped to white (giving red, green and blue additive color components R=G=B=max., corresponding to their square root coded values R′=G′=B′=255). Note that if in this application a dynamic range (DR) is specified for starters with a peak brightness of a coding (PB_C), which PB_C would correspond to a theoretical reference display's peak brightness of a display (PB_D), for optimally rendering out the coded lumas as displayed luminances (i.e., the brightest rendered or renderable luminance) only, we assume that the lowest luminance value is pragmatically zero (whereas in practice it may depend on viewing conditions such as display front plate or cinema screen light reflection, e.g., 0.1 nit), and that those further details are irrelevant for the particular explanation. Note also that there are several ways to define a dynamic range (DR), and that the most natural one typically used in the below explanations is a display rendered luminance dynamic range, i.e. the luminance of the brightest color versus the darkest one.
Note also, something which has become clearer during HDR research, and is mentioned here to make sure everybody understands it, that a code system itself does not natively have a dynamic range, unless one associates a reference display with it, which states that, e.g., R′=G′=B′=Y′=255 should correspond with a PB of 100 nit, or alternatively 1000 nit, etc. In particular, contrary to what is usually pre-assumed, the number of bits used for the color components of pixels, like their lumas, is not a good indicator of dynamic range, since, e.g., a 10 bit coding system may encode either a HDR video, or an SDR video, determined by the type of encoding, and in particular the electro-optical transfer function EOTF of the reference display associated with the coding, i.e., defining the relationship between the luma codes [0, 1023] and the corresponding luminances of the pixels, as they need to be rendered on a display.
In this text it is assumed that when a HDR image or video is mentioned, it has a corresponding peak brightness or maximum luminance for the highest luma code (or equivalently highest R′, G′, B′ values in case of an RGB coding, e.g., if RGB coding would be used instead of YCbCr encoding) which is higher than the SDR value of 100 nit, typically at least 4× higher, i.e., the to be rendered maximum display luminance for having the HDR image look optimal may be, e.g., 1000 nit, 5000 nit, or 10000 nit (note that this should not be confused with the concept that one can encode such a HDR image or video as a SDR image or video, in which case the image is both renderable on a 100 nit display, but importantly, also contains all information—when having corresponding associated metadata encoding a color transformation for recovering the HDR image—for creating a HDR image with a PB of, e.g., 1000 nit!).
So a high dynamic range coding of a high dynamic range image is capable of encoding images with to be rendered luminances of, e.g., up to 1000 nit, to be able to display render good quality HDR, with, e.g., bright explosions compared to the surrounding rendered scene, or sparkling shiny metal surfaces, etc.
In practice, there are scenes in the world which can have very high dynamic range (e.g. an indoors capturing with objects as dark as 1 nit, whilst simultaneously seeing through the window outside sunlit objects with luminances above 10,000 nit, giving a 10000:1 dynamic range, which is 10× larger than a 1000:1 DR, and even 100 times larger than a 100:1 dynamic range, and, e.g., TV viewing may have a DR of less than 30:1 in some typical situations, e.g., daylight viewing). Since displays are becoming better (a couple of times brighter PB than 100 nit, with 1000 nit currently appearing, and several thousands of nits PB being envisaged), a goal is to be able to render these images beautifully, and although not exactly identical to the original because of such factors like different viewing conditions, at least very natural, or at least pleasing.
The reader should also understand that because a viewer is typically watching the content in a different situation (e.g. sitting in a weakly lit living room at night, or in a dark home or cinema theatre, instead of actually standing in the captured bright African landscape), there is no identity between the luminances in the scene and those finally rendered on the TV (or other display). This can be handled inter alia by having a human color grader manually decide about the optimal colors on the available coding DR, i.e., of the associated reference display, e.g., by prescribing that the sun in the scene should be rendered in the image at 5000 nit (rather than its actual value of 1 billion nit). Alternatively, automatic algorithms may do such a conversion from, e.g., a raw camera capturing to what will be generically referred to herein as a master HDR (M_HDR) grading. This means one can then render this master grading on a 5000 nit PB HDR display, at those locations where such a display is available.
At the same time however, there will for the coming years be a large installed base of people having a legacy SDR display of 100 nit PB, or some display which cannot make 5000 nit white, e.g., because it is portable, and those people need to be able to see the HDR movie too. So there needs to be some mechanism to convert from a 5000 nit HDR to a 100 nit SDR look image of the same scene.