The newly emerged field of High Dynamic Range (HDR) imaging contrasts itself with legacy systems, which nowadays by comparison we can call low dynamic range (LDR) imaging (an which comprise such image or video encoding systems like PAL or MPEG2, AVC, HEVC or another member of the MPEG family, or similar video standards like e.g. VC1, VC2, etc., or JPEG for still pictures etc.)
When talking about HDR, we need to look at the various components of the chain. As this is a very recent area of technology where perhaps not everybody is on the same line, we want to quickly set a reference mindset with a couple of important definitions, to avoid misunderstandings. Ultimately there is the rendering dynamic range, which the display medium can generate. Rendering dynamic range is usually defined as RDR=brightest_white_luminance/darkest_black_luminance of all pixels in an image (intra-picture RDR) or of pixels in successive images (inter-picture RDR, e.g. when the display is (nearly) switched off, and one only sees the reflection of the surrounding environment on the front glass). However, it is more meaningful when also coupled to a peak_white value (i.e. the brightest_white_luminance value). LDR renderers usually lie in or around a range defined by peak_white of 100 nit, and a dynamic range of around 100:1. That is what a CRT display might have produced, where of course the darkest_black_luminance strongly depends on the viewing environment illumination, so one may go for 40:1 to be on the safe side, and also 2:1 can be a practical dynamic range when one views images on a display under the sun. The viewing environment which conditions the human viewer brightness adaptation is related to that, e.g. typically 20% of peak_white. Several standards of EBU, SMPTE etc. specify how one should grade a video signal so that it can be used in a standard way, e.g. it is optimal if shown in the prescribed viewing environment. By grading we mean producing an image with changed pixel colors, which are changed/specified according to some preference. E.g., a camera can automatically grade a RAW camera picture (which is just dependent on the camera specifics as a linear luminance measuring instrument) given a rendering intent into a directly usable display-referred encoding, with which one can steer e.g. such a CRT display under reference conditions so that it will show a neat picture to the viewer.
Oftentimes grading by a human involves more artistic choices. E.g. the grader wants to make the color of a plant a nice purplish color, but this needs to be specified under reference conditions (both of the display technology and the viewing environment, and in theory also other conditions affecting the state of the viewer like medicament uptake, but of course one typically ignores those largely), because a particular display may make this color more bluish, in which case the desired artistic effect (of creating a beautiful picture) may be gone. It is not typical that a camera automatically creates the optimal kind of purple, so that is why the grader does that with image-processing software. Such a grader can be both a photographer, or a visual artist working on a movie, or even somebody working on a (potentially even life) television program. Of course various applications will have various degrees of grading complexity linked to the desired technical and/or artistic quality for those applications. Typically the above standards prescribe that a grading shall be done on a reference monitor of around 100 nit in a reference environment. The question is then how a color will be rendered and perceived in practice. Graphics artists for printed press publications also generate their work under reference conditions to have some common ground, and avoid needless sources of error e.g. at the printer's. However, that doesn't mean of course that each reader of the book or magazine will read the book under a calibrated D50 lamp, but rather he may perceive more dull colors when reading in his bed under bad illumination. The same happens when a movie or television program, or a consumer photo, is shown on a non-reference display from among the many different displays that are available nowadays. E.g., the image (grading) may be shown on a 500 nit peak_white display. What happens then is that one brightens all pixel colors by at least linear stretching, which occurs by driving the display with the grading, i.e. mapping maximum white (e.g. value R=G=B=255) to the peak_white of the display (of course there may be further brightness deformation for the various image pixel colors if the display has a special native electro-optical transfer function EOTF, but usually that is handled internally to make the display behave like a brighter version of a reference CRT, i.e. with a display gamma of around 2.5).
Now such standardized (produced in a reference environment inter alia on a 100 nit reference display) LDR gradings can be used (i.e. look reasonably good, i.e. still reasonably similar to how they would look under reference conditions) on a range of display and/or environment conditions around the reference display system (i.e. 100 nit peak_white etc.). This is because most humans are not so supercritical about the exact (absolute) look of colors since the brain works relatively (e.g. depending on the criteria for allowability, face colors which are one of the more critical colors may vary from paleish almost white, to quite orangeish, etc., before the less critical larger part of the population starts to object), but also because for many objects nobody knows what the original colors in the scene were. Partially this is also so because LDR scenes are made with an “around the average” object color strategy (which is realized inter alia with well controlled studio lighting, maybe not always so anymore with the various on-the-fly content we have now), which means all colors are vivid, one may even brighten the image somewhat to above the 18% level, with some shadows but not too deep or important etc., and that reproduces both physically and psychologically rather well on various systems. It is e.g. how naïve painters work before they discover such complex issues like clair obscure etc. So depending on the quality criterion defining acceptable similarity, the LDR_100 nit grading may be used e.g. on displays from 30 nit up to 600 nits, and viewing environments from 3× less bright to 5× more bright. The latitude for using a grade can be increased by modifying it with a so-called display transform. The brightness of a display and surrounding (related to Stevens effect and Bartleson_Brenneman effect) can be corrected to a reasonable degree far more easily than issues related to display gamut constraints, and one typically can process the picture with gamma functions or similar. E.g. when moving a display from a dim surround to a dark surround (or in fact switching off the cozy living room viewing lights), one changes from an extra gamma of 1.25 to 1.5 i.e. one uses the residual gamma to increase the contrast of the rendered images, because human vision is more sensitive in the dark hence perceives the blacks of the rendered image as more grayish, which amounts to a reduction in perceived contrast which has to be compensated. A similar LDR technology is printing. There of course one does not have a priori control over the surround illuminance determining the peak white of the print, but at least, just as with all reflective objects, the white-black RDR is about 100:1 (depending on paper quality, e.g. glossy vs. matte, inks, etc.).
A complication arises when one needs to reproduce an image of a scene with huge dynamic range, and typically also scene conditions very unlike the rendering conditions. E.g. in a night scene they eye may be looking at a scene dynamic range SDR between car lights of 100.000 nit (or e.g. even more for a high pressure sodium or mercury lamp in the scene) versus dark regions in shadows of fractions of a nit. Even in daylight, where it may be more difficult to create dark shadows from the all-pervasive illumination, indoors it may typically be 100× darker than outdoors, and also dark clouds, forrest cover, etc. may influence needed luminances (whether captured or to be rendered), if not in intra-scene, then at least in inter-picture i.e. temporally successive reproduction. Quotes for the “native dynamic range” of human vision vary between 10.000:1 and 100.000:1 and even 1.000.000:1, because this depends of course on the conditions (e.g. whether one needs to see a darker small region in the brights, or vice versa whether one can see some bright small object in the dark, be it perhaps partially rhodopsin-bleaching; whether one considers an amount of glare discomforting, etc.; and then there is of course also a psychological factor [taking into account such things as importance of certain objects, their perfect or sufficient visibility, emotional impact on the viewer, etc.], leading to the question how much of that should be rendered on a display [e.g. a viewer may quickly discard an area as “just black” without caring which black exactly], given that the viewer is in a totally different situation anyway [not really on holiday, or not really interrogated by a police officer shining a light in his face], but one wants a certain amount of realism which may further be a trade-off with other factors like e.g. power consumption, so one could pragmatically in fact define several human vision dynamic ranges, e.g. one for a certain type of real scene viewing, and one for television viewing). E.g. if one is adapted to the dark night sky, but sees the moon in the corner of the eye, that has less influence on how the rods in other places of the retina can see the faint stars, i.e. “simultaneous” viewable dynamic range will be high. Conversely when the eye is bathed in strong daylight (over a large area of its field of view) it is more difficult to discriminate the darker colors in a darker interior seen and illuminated through a small hole or window, especially if a bright source is adjacent to that dark area. Optical systems will then show several glare phenomena. Actually the brain usually may not even care about that dark interior, and just call all those colors psychological blacks. As another example of how the leakage of light influences and determines scene dynamic range from the perspective of a human viewer, consider a badly illuminated dark bush in the night behind a light pole. The lamp on the light pole creates a light scattering profile on the scratches of the glasses of the viewer (or if he doesn't wear glasses the irregularities in his eye lens, e.g. submicron particles, water between cells, . . . ), in particular as a halo around the lamp which reduces the discrimination possibility of the dark colors of the bush behind it. But when the viewer walks a couple of seconds the lamp moves behind him outside the capturing zone of the eye lens, and the eye can quickly adjust to find the predator lurking in the dark.
So however one defines the useful dynamic range of a scene for encoding and rendering for human consumption (one may even consider not to only encode the intra-picture luminances with a global lightness scaling factor, but the actually occurring luminances from a sunny tropic environment to the darkest overcast night), it is clear that far more than 100:1 is needed for faithful or at least plausible rendering of these environments. E.g. we desire our brightest object on a display for dim surround to be around 10000 nit, and our darkest 0.01 nit (or at least 0.1 nit), at least if we could e.g. dim the lights in case we have fully or mostly dark scenes in the movie or image(s).
This is where HDR comes in. And also, when one captures such a scene it needs very complex mathematical mapping to approximate it (or even be able to render it) on an LDR display (this in fact oftentimes being not really possible). E.g. some HDR-to-LDR mapping algorithms use local adaptation to kind of equalize out the illumination field leaving in the LDR rendering mostly an impression of the object reflections i.e. colors. In view of the leakage (multiple reflection, scattering, etc.) of light from brighter to darker parts of a scene it is not easy to create extremely high dynamic range scenes, but an illumination difference of 100:1 can easily be achieved in many practical situations. E.g. an indoors scene may have (of course dependent on depth of the room, size and position of the windows, reflectivity of the walls, etc.) a fraction or multiple of about 1/100th of the outdoors (il)luminance (which is also how the daylight factor of building lighting is defined). Higher SDRs can be obtained when watching a sunny outdoors from within a cave through a small crack, etc. Also on the display rendering side, a HDR range starts where one starts seeing new appearance concepts. E.g., on bright displays like a 5000 nit SIM2 display, one can given the right input pictures (rightly graded) realistically render impression of real switched-on lamps, or real sunny landscapes. In distinction with the above LDR range, we may typically say that HDR starts for normal television living room viewing conditions from around a 1000 nit peak_white and above, but more precisely this also depends on the exact viewing conditions (e.g. cinema rendering, although with a peak_white of 50 nit, already shows quite some HDR appearances). To be even more precisely in view of eye and brain adaptation the HDR-ish look in numerical detail would also depend somewhat not just on the physical luminances but also the image content, i.e. the chosen grading. But in any case there is a clear discrimation between LDR rendering which mainly shows a dull, lightless version of the scene, as if it was nearly illuminated homogeneously and just showing the object reflectances, and HDR, in which a full lighting field appearance is superimposed. If you can then render reasonable blacks, e.g. 1 nit or below, you can indeed get above an LDR contrast range of k×100:1, where k is typically 2-3 (which under a particular paradigm of near-similar, i.e. with only perhaps a small contrast stretch, relative rendering of the displayed luminances compared to the scene luminances would correspond to a similar DR in the scene). On the high end of brightnesses it is partly a matter of taste where the brightness should end, in particular where further brightness only becomes annoying. We found that to grade several kinds of HDR scene 5000 nit is still somewhat on the low end, in particular when having to deal with further display limitations like backlight resolution. In experiments we found that definitely one can go to 10000 nit in dark viewing without the brightness getting superfluous or irritating (at least to some viewers). Going above 20000 nit peak_white it may be a practical technical design consideration of what to render true-to-life luminance-wise, and what to approximate, giving at least a brightness appearance. Note that one typically should not drive such a bright display always at maximum brightness, rather to make an optimal HDR experience one should only use the brightest rendering at certain places and times, conservatively, and also well-chosen as to their temporal evolution. One should not only focus on intra-picture DR, but also on how different brightness environments are to be rendered in succession, taking human visual adaptation into account.
Another dynamic range is the camera dynamic range CDR, which is just (given the exposure settings) determined by the full well of the pixel's photodiode, and the noise on the dark side. When using tricks like multiple exposure or differently exposable pixel arrays (e.g. in 3 chip cameras), the CDR becomes limited by the optics (e.g. lens scattering, reflection on the lens or camera body, etc.), but also this can be improved by suitable computational imaging techniques which try to separate the real illumination from dark scene regions from erroneous irradiation due to stray light. Of course when the source of the image is a computer graphics routine (like e.g. in special effects or a gaming application) one can easily create HDR far beyond those limitations. We will ignore the CDR, and just assume it is either very high, or perhaps a limiting factor but in a system which is supposed to handle situations of very high originals. In particular, when we introduce clipping we will assume it is not due to a low quality camera capturing, but due to a practical handling of some other limitations in the entire imaging chain, like the inability of a display to render very bright colors.
Now apart from the display environment RDR, which does actually generate the right photon distribution to stimulate the viewer into the right sensation (be that also dependent on the adaptation state of that viewer), when talking about handling or coding HDR, there is another interesting aspect, which can also be summarized in a dynamic range, which we shall call coding dynamic range CODR. A couple of thought experiments should clarify this important concept. Suppose we were to draw on a bright back-illuminated white panel with a highly absorbing black marker, so that we would get a transmission of 1/16000th of the surrounding white of the panel (and assuming the surrounding room and viewer are perfectly absorbing objects). In the linear bits world (by which we mean that we linearly represent all values between say 0 and 2^B, where ^is the power operation and B the number of bits) of e.g. the camera capturing (its ADC) we would hence need 14 bits for representing this signal. However, as this codec would waste a lot of codes for values which don't occur anyway, we can say that to faithfully represent that particular signal, we theoretically only need a 1-bit encoding. We would give black the code 0, and white a 1, and then convert them to whatever actual luminance they correspond to. Also note that a display need not in fact render those values with exactly the same luminances as in the scene. In fact, since this signal may look no better (psychologically and semantically) than a lower DR equivalent thereof (actually such a high contrast black and white drawing may even look weird), we might as well render it on a display with values 1 nit and 2000 nit. We see here for the first time an interesting distinction which is important when talking about HDR encoding: the difference between physiological and psychological (or semantic) dynamic range. Human vision consists of two parts, the eye and the brain. The eye may need as a precursor the appropriate physiological dynamic range PDR to appropriately stimulate cones and/or rods (and thereby ganglion cells etc.), but it is ultimately the brain that determines the final look of the image or scene (psychological dynamic range PSDR). Although it doesn't quite give the exact impression of a very luminous region, painters like Petrus Van Schendel can play on the PSDR psychological principles to emulate in an LDR medium high dynamic range scenes like e.g. a fire in a dark night cityscape. This is also what complex gamut mapping algorithms try to do when preconditioning a HDR image for rendering on an LDR display. But the other side of this principle is that some scenes will look more HDR-ish even on a HDR display than others (e.g. a sunny winter landscape with pale dried shrubs and some trees in the back may look high brightness but not so HDR). For HDR actions, like e.g. turning a bright lamp towards the viewer, psychological emulations are usually not so convincing as the real bright rendering of the regions.
Consider along the same lines now a second example: we have an indoors scene with luminances of say between 200 nit and 5 nit, and an outdoors scene with luminances of say between 1500 and 20000 nit. This means that again we have two luminance histograms separated by non-existing codes. We may natively encode them on a range of say 16 linear bits (the maximum code e.g. corresponding to 32768 nit), although it would be preferable to use some non-linearity to have enough accuracy in the blacks if there's not too much capturing noise. But we could also encode this in a different way. E.g. we could sacrifice 1 bit of precision, and divide an 8 bit nonlinear JPEG luma range in two adjacently touching parts, the below one for the darker part of the above scene, and the upper one for the lighter (one may not want to cut exactly in the middle in view of the non-linear JND allocation). If one is concerned about loss of precise detail when having less bits, one may consider that it may often be better to use available bits instead for HDR effects. Such an allocation would typically correspond to a shifting and (non-linear) stretching of the luminance (L) values of the input RAW capturing to the 8 bit luma (Y) values. Now one can again ask oneself the question of what a dynamic range of such a scene is, if it can be “arbitrarily” compressed together or stretched apart (making the brighter outside even brighter, at least until this becomes e.g. unrealistic), at least in post-processing for rendering. Here the concept of different appearances can help out. We have in both subhistograms a number of different luminance values for different pixels or regions, which assumedly are mostly or all relevant (if not, we don't need to encode them, and can e.g. drop one or more bits of precision). Also the separation (e.g. measured as a difference in average luminance) of the two histograms when ultimately rendered on a display has some appearance meaning. It is known that human vision to some extent discounts the illumination, but not entirely (especially if there are two brightness regions), so one needs to render/generate those eye inputs to at least a certain extent. So working with meaningful different color (or at least brightness or lightness) appearances of pixels or objects in a renderable scene (e.g. when rendered in the best possible display scenario) gives us an insight about the coding dynamic range CODR, and how we hence need to encode HDR images. If the image has many different appearances, it is HDR, and those need to be present somehow in any reasonably faithful encoding.
Since classical image or video encoding technologies (e.g. PAL, JPEG, etc.) were primarily concerned with rendering mostly the object (reflection) lightnesses in a range of 100:1 under originally relatively fixed viewing conditions (a CRT in a home environment, and not an OLED in the train, or the same consumer having in its attic a dedicated dark cinema room with on-the-fly dynamically controllable lighting, which can adjust to the video content), those systems encoded the video in a rather fixed way, in particular with a fixed universal master encoding gamma which mimics the brightness sensitivity of the eye, like e.g. V_709=1.099L^0.45-0.099, which is approximately a square root function. However, such systems are not well-adapted to handle a vast range of CODRs. In the last couple of years there have been attempts to encode HDR, either in a native way of scene-referred linearly encoding all possible input luminances, like in the OpenEXR system (F. Kainz and R. Bogart: http://www.openexr.com/TechnicalIntroduction.pdf). Or, there are 2-layer systems based on the classical scalability philosophy. These need at least two images: a base image which will typically be a legacy-usable LDR image, and an image to reconstruct the master HDR image(s). An example of such is US2012/0314944, which needs the LDR image, a logarithmic boost or ratio image (obtained by dividing the HDR luminances by the LDR luminances obtained after suitably grading an LDR image for LDR rendering systems), and a color clipping correction image per HDR to-be-encoded image. With a boost image one can boost all regions (depending on subsampling) from their limited range to whatever luminance-position they should occupy on the HDR range. Note that for simplicity we describe all such operations in a luminance view, since the skilled person can imagine how those should be formulated in a luma view of a particular encoding definition. Such multi-images are at least the coming years somewhat cumbersome since they need seriously upgraded (de)coding ICs in existing apparatuses, since the handling of further images in addition to the LDR image is required.
Recently and as described in WO2013/046095 we have developed a way to improve the classical video encoding (preferably with minor modifications, preferably with mostly metadata to apply transformations relating two gradings of the same scene for two very different rendering conditions, such as e.g. allowing to transform an encoded LDR grading in a HDR grading or vice versa, and perhaps with some variants having room to store in the metadata a couple of additional small pictures to do a final tuning if such a further modification is desired, e.g. an additive or multiplicative correction on a small regions containing an object like e.g. a very brightly illuminated face in one shot or scene of the movie, in which the corrective factors per pixels may then be encoded e.g. in 200 120×60 pixel images to be mapped onto the pixel positions of the current HDR reconstruction by color transformation, or even some subsampled representation of those small corrective images, to be applied as coarse finetuning mappings, described as images) to be able to encode high dynamic range images. In this system typically a human grader can determine an optimal mapping function from the input HDR image (master HDR grading) to a e.g. 8 or 10 (or 12 or in principle another value for at least the luma codes, but this value being typically what is reserved for “classical” LDR image encoding) bit LDR encoding which can be encoded through classical video compression (DCT etc.), the optimal mapping function (e.g. a gamma function or similar with optimal gamma coefficient, linear part etc., or a multi-segment function like e.g. an S-curve etc.) typically depending on what the content in the master HDR was (e.g. a dark background, with a very brightly lit region), and how it will be rendered in LDR conditions. We call this simultaneous encoding of an LDR and HDR grading by mapping the HDR grading into a legacy-usable LDR image and LDR-container encoding of HDR. We wanted to make sure in this technology, that it was backwards compatible, in that the so-generated LDR image gives reasonable results when rendered on a e.g. legacy LDR system (i.e. the picture looks reasonably nice, if not perfect typically not so that too many people will consider the colors of some objects all wrong). If one accepts somewhat of a diminuation of precision, our system can even encode HDR scenes or effects on legacy 8 bit systems. With reasonable results we mean that the LDR rendered images, although perhaps not the best one theoretically could achieve artistic look-wise, will be acceptable to a content creator and/or viewer, this depending of course on the application (e.g. for a cheaper internet-based or mobile service quality constraints may be less critical). At least the LDR grading will give good visibility of most or all objects (at least the objects of main importance for the story of the image or video) in the imaged scene when rendered in an LDR system of properties not deviating much from standardized LDR rendering. On the other hand, for HDR displays, the original master HDR can be approximated in a close approximation by mapping with the invertible reverse of the co-encoded mapping function from the LDR image to the reconstructed HDR image. One can define such an approximation with mathematical tolerance, e.g. in terms of just noticeable differences (JNDs) between the original master HDR inputed, and its reconstruction. Typically one will design any such a system by testing for a number of typical HDR scenes, actions, and further situations how much different the reconstructed HDR looks (if that is still acceptable for certain classes of users, like e.g. television or movie content creators) and validate a class of operations like particular gamma mappings within certain parameter ranges therefrom. This warrants that always a certain quality of the approximation can be achieved.
It is an object of the below presented technologies to give the grader even more versatility in defining at least two gradings, LDR and HDR.