Recently image capturing, displaying and in particular encoding has improved from so-called low dynamic range (LDR) imaging (such as classical systems like PAL or MPEG2) to so-called high dynamic range imaging (HDR). Sensors have nowadays either a higher native signal voltage range (between the scene luminance that saturates, or at least gives the maximally allowed pixel voltage, and the minimum, or alternatively typical noise level), or they have techniques for stretching that sensor range by composing resultant images from multiple images, e.g. from spatial systems with different sensitivity, or successive pictures with different exposure setting. The difference with an LDR camera capturing is that such an LDR camera typically clips and/or soft clips some areas, like the bright luminances outside becoming white (the luma Y of those stored encoded LDR image pixels being 255), whereas a HDR capturing system can reasonably faithfully capture all luminances in the scene. Although it is then still a question what to do with them, i.e. how to encode them for e.g. transmission over a television network system, and how to (faithfully, or in a preferred way or at least acceptably) render them, on e.g. a HDR display, which has higher peak brightness than the typical peak brightnesses of LDR displays (e.g. 3000 nit, instead of 100 or 500 nit).
Since the look of a rendering of a picture depends on many variables such as i.a. the contents in the picture, the kind of display rendered on (such as its peak brightness), and the viewing environment, typically the captured raw sensor picture (which may be tightly related to the original scene, but has absolutely no relation with the final rendering environment, so no information regarding how a human will see these two scenarios) undergoes a transformation of its pixel colors, which is called a grading. Typically this may be a human grader. E.g. in a movie production it may be difficult to accurately light a house interior (also given timing and pricing constraints), let alone create thunder clouds of a particular greyness pattern. The scene lighting crew may then go for an approximately correct lighting, which at least creates “enough” or “the right amount of” light everywhere, and may position the practicals (e.g. atmosphere lighting like candles on a table (or something simulating that), a neon billboard, etc.). But a grader then improves upon that in image processing software, e.g. he may draw sunbeams as if the fell through the window in the actual scene.
LDR encoding had another property characterizing it. Naively one may think LDR is just encoding in which the lumas have an 8 bit code word per pixel (or similar embodiments of course), and vice versa 8 bit means LDR. But in theory one could encode whatever in those image arrays of 8 bit codes, so very complex patterns could be encoded at least in theory, so why not HDR images.
The issue was, and that's partially the legacy of a long historical tradition, that the sensor voltages (i.e. linear representations of the scene luminances) were encoded into the 8 bit code words according to a particular code mapping function. This was a simple, not too non-linear monotonous and continuous function, namely a gamma 2.2. The idea was that this tight linking of capturing, coding and rendering through such a direct connection system, would amount to the correctly doing the grading almost automatically. The signal was directly applied to the cathodes of a CRT display, and it was due to this CRT physics that the gamma 2.2 was chosen (which incidentally also gave a reasonably uniform psychovisual brightness scale to work with). If there was only a single type of display, it would correctly render the driving values into output luminance, if only it was driving by driving signals being the LDR signal. And that was automatically pregraded with a compensating gamma, namely approximately 1/2.2, straight from the camera. But also, should any grading artist on the creation side want to fine tune or improve the pixel colors, he would do so while watching the signal on exactly the same CRT on the creation side, so the consumer home t.v. would give approximately exactly the same rendering (apart from surround effects on the viewer), because it was driven by that same corrected image.
In any case, this LDR encoding chain functioned as a closed specification, in which rendering and encoding (or grading) amounted to the same thing. Nowadays, having very different displays, like an LCD at home, an Ipad for watching image content on the train, a home projector, and recently very high brightness HDR displays, necessitates that rendering or gamut mapping should be a phase totally separate from image encoding, since given the same input images, these displays will show quite a variation among their output looks, which may be more severe than desirable.
But in any case, on the content creation side, e.g. between camera and encoding, this tight link was still followed in LDR systems. Although modern consumer cameras (especially since recently they start incorporating HDR functionality) may use a more sophisticated code mapping function than a gamma 2.2, they still have relatively similar functions, which are not highly non-linear, i.e. not so different that we cannot approximate many aspects of their mathematical behavior with a linear analysis.
In particular this is seen when a scene of higher luminance range has to be captured, such as e.g. of a person sitting in a car. A combination of factors such as exposure of the person's face, and the code mapping function (e.g. an S-curve), typically leads to the fact that if one exposes well for the interior of the car, that the outside can only be represented with pastellish colors near the upper boundary of the code gamut, i.e. with lumas near 255. That is because the camera or cameraman e.g. chooses to have the face color code mapped near average grey, let's say for simplicity value 128. If we approximate that the mapping function around this value is a square function, then value 255 can only represent outside lumas of 4× higher. Of course the actual values will depend on how smart the camera system (inclusive the human operator choices) will handle such bright regions, and an appropriate shoulder in the code mapping may still at least allocate different code values to higher scene luminances than 4× the luminance of the face (although it must also be said that in reality quite some of the content quickly produced when shooting on location without much preparation, clips a significant part of the image to 255, and it is questionable whether that is so desirable).
In anyway, as a rough measure one can say that above luminance ratios of 500:1 (or at least 1000:1), LDR encoding becomes problematic, and we enter the HDR encoding technology field, at least if we want to encode the scene right. So this happens with geometric form factors which create an illumination unevenness of about 5-10 to 1, highlight to shadow, since reflections of objects typically range between 1% and 100%. Such an illumination reduction can already happen in a room a couple of meters away from the window.
An example of a high dynamic range scene, which also clearly manifests a distinct color scheme to the human viewer, is a dusk cityscape. The whites have become light greys to human vision, and white seems to be missing in the scene, as the lights already jump to a brightness level above that (“light”). I.e., one would like to be able to show these on a HDR display as light objects, and also code them in a way that they can clearly be recognized (especially by renderers which don't directly apply the input signal as driving signal, but do some gamut mapping optimization) as lights. Note that because of the decoupling of the camera-capturing, coding, and display, one should make careful discriminations as to which dynamic ranges one specifies (and they should not always be luminance contrast), since a particular e.g. 100000:1 dynamic range scene may not necessarily need the same contrast when rendering (e.g. the sun on the display need not actually be able to hurt your eyes), the actual relevant factor being the psychovisual reasonable similar appearance. Let alone that in a generic, highly non-linear encoding this should say anything about a dynamic range of a codec, since such factors like particular mapping or coding/rendering precision may all have an influence on that. As to display rendering, one knows one has a HDR display system, if it can render in particular light effects which could not be rendered faithfully on LDR display, such as real shining lamps, or real-looking sunlighting of outdoors scenes. And in particular the lightnesses of other scene objects (e.g. indoors furniture) are coordinated with that, i.e. given such lumas that a good appearance results for both the light and normal/darker objects (human vision being relative).
The (native) solution first envisioned for HDR image encoding, was i.a. conceived by people working in the computer graphics arena, since in a computer any kind of signal can be made (without capturing lens limitations, in a computer the universe next to a supernova can really have a zero luminance, also without any captured photon noise). In that framework being able to totally abandon any previous television technology constraint, a logical solution would be just to encode the scene luminances linearly. This would mean that a higher amount of code bits were needed for the pixel lumas, e.g. 16 or 32. Apart from the higher amount of data, which may for video sometimes be an issue, as said above, such native encoding has absolutely no link (or embedded technological knowledge, like additional values, measurements, or knowledge included in equations, which could be co-encoded as metadata together with or separate but linkable to the encoded pixel image) with the rest of the imaging chain, i.e. the rendering system.
An alternative second way of encoding was inspired by or at least conceptually relatable to dual display systems, like dual LCD panel displays, or single panel LCDs with a 2D modulatable backlight. In these systems, the final output is a multiplication of the light pattern produced by the back layer display and the transmission of the front LCD. The question is then how to drive both signals, given that e.g. we have as above a native 16 bit (at least luma) HDR encoding, and a standard driver electronics and physical modulation capability of the LCD of say 8 bit (which means on a linear transmission the LCD can make a black of 1/255 of its full transmission, and potentially somewhat different values for non-linear behavior; and say e.g. the backlight is also modulatable by 8 linear bits). A simple solution would then be to take the square root of the pixel lumas, and send 2× this square root to the two drivers. In principle any multiplicative decomposition would (theoretically) do. E.g., if the LCD could only vary the transmission in 4 steps (2 bit linear), one could still make the exact HDR system, if only one drives the backlight with a signal giving the remainder of a division:Y_backlight=Y_HDR/Y_LCD, in which the Y_LCD would in this example more brightly or darkly modulate what light is behind in 4 different ways (e.g maximally block, which may be e.g. transmit 1/80th of the light behind, vs. transmit 100% and 2 equidistant transmissions in between).
The Y_HDR would be the 16 bit signal, in which the maximum value would signify some very bright scene luminance, approximately renderable by switching the backlight of the display (locally) to its maximum value (taking into account heating, aging, etc.). So, again using a linear coding because that is how the rendering works physically, the backlight would need to make a range of ¼th the 16 bit (65536 linear steps to be made), which (again if we suppose we need a linear coding and equidistant driving) means the backlight will be driven by a 14 bit signal (if such precision is needed). The backlight can hence change the local value into the LCD valve by any factor needed to render the HDR image. In fact, since these displays contained a far smaller number of LED backlight elements than pixels, some approximation of the image was rendered, by driving the backlight according to some average illumination. So e.g. like in claim 2 of U.S. Pat. No. 7,172,297 of the university of British Columbia, one first calculated the average luma of the local image pixels, and this resulted in a backlight value approximating the needed rendering, and then one set the LCD pixels as the division of the Y_HDR and this approximation. So the interesting property of this multiplication, is that it corresponds to a reduction in the linear bits to encode one of the images, which can be mathematically seen as some kind of range compression, or gamut mapping.
So one elaborated further on this, namely, to encode any HDR picture based on such a multiplicative scheme (not necessarily for a real two-layer display). I.e. one could form a first picture by doing some generic tone mapping, and create a standard JPEG picture (Y_JPEG) from this mapped resulting 8 bit image. And then one stores a second picture, which is the ratio image Y_HDR/Y_JPEG. So at the decoder side, one can then use the normal LDR JPEG picture, or recreate a HDR picture by multiplying the two LDR pictures (assuming the original was 16 bit yielding two 8 bit pictures, which is in general sufficient for most if not any HDR scene or scenario). A first disadvantage of this method is that, although any HDR image can so be encoded (by correcting whatever is in the JPEG picture in the ratio picture, or at least coming to a reasonable approximation should the JPEG be so badly encoded that the resulting correction goes over the possible range, which could happen e.g. if two adjacent pixels are chosen to be 1 in the JPEG, but should be 230 resp. 350 in the HDR, again assuming linearity), but at the price of needing to encode 2 pictures. Having no savings by any mathematical correlation, apart from needing the surrounding semantics to format those two pictures, one would prima facie seem to need the same amount of bits as when storing a single 16 bit image (at least if one doesn't spatially subsample etc.). Secondly, this “blind” decomposition has nothing to do with the physics of the actual renderer, or physical or psychovisual semantic laws present in the rendered scene (such as which object is merely a bright lamp), rather it merely results from a multiplicative correction of whatever one has chosen to become the JPEG base image. But it is a nice backwards compatible strategy to encode images.
A third way of coding could be traced from a history of prediction-correction scalable codings, in which a prediction is corrected by an additive correction image. Originally this happened in inter alia SNR scalability, and the first image was an approximation, which may contain rounded or quantized versions of the pixel lumas. Onto that was added a picture which added further precision (note that other variants could contain e.g. a spatial approximation, which could also be corrected by adding a correction signal, which then would also restore high frequencies, e.g. at boundaries). So if e.g. the original (LDR) signal to be encoded had spatially adjacent pixels 127, 144, one could e.g. encode an approximation of 6 bits with precision steps of 4, giving pixel values 128 and 144. One could then correct this with an image of higher precision containing the values −1 and 0. Since the approximation was already largely good, the range of the correction signal should be lower, which could result in bit savings.
Since range and precision within a range can in principle be interchanged, one could also envisage using such a technique for encoding HDR images. In fact, one could define the maximum of any coding range (also an 8 bit encoding) to correspond with whatever scene luminance. But this was seen to be probably only reasonable for larger than 8 bit encodings, given the amount of brightness steps in HDR scenes. Also, mere scalability does not imply any change in tone mapping, i.e. by definition just handles the precision of lumas question, but does not state anything as to how a particular LDR encoding would relate to any HDR encoding, or how any encoded image would need to be optimally rendered on any display (without e.g. being rendered too dark in general on a display of lower peak brightness).
Further building on this concept, a two-layer HDR encoding method was developed as in WO2007/082562 (see FIG. 1). In such an encoder, one recognizes there is a relationship between HDR and LDR, as it may be captured, encoded (e.g. by means of gamut mapping), or typically graded (typically by an artist grader, working for the content producer). E.g., since an LDR gamut (as defined by what a typical LDR display of say 400 nit would render) may not be able to contain bright regions faithfully, such as a sunny outdoors, one may map to the LDR space such a region by lowering its lumas (and potentially also decreasing color saturation). Making a HDR image from such an LDR encoding of the original scene, would involve mapping pixel lumas/colors of those bright outdoors regions of the image to higher brightnesses (or in other words predicting what a HDR graded image could be like), e.g. by offsetting those LDR lumas by adding a fixed or LDR-luma-dependent brightness, or in general applying a mapping function to at least the lumas: Y_HDR=f(Y_LDR). One would at least get a more HDR-ish look, but how close this prediction would be to the original HDR grade, would strongly depend i.a. on the correctness (and complexity) of the mapping/prediction function. Because of the high complexity of an image (making people normally choose for a simpler prediction, e.g. a global tone mapping which maps each pixel luma solely on the luma value and no other factors like the spatial position of the pixel in the image, rather than a more complex one which doesn't fully accurately predict the original HDR image anyway), there will be a difference, and this will be a difference image. So these two layer methods will also encode this image. Because the difference between an LDR grade (which in principle doesn't even have to be close or similar to the HDR grade, but could be anything) and an HDR grade is entirely different from a difference between an X bit accuracy and an X+Y bit accurate representation of a signal, these difference images need not have a restricted range of values. They could in principle be anything, even up to a 16 bit image like the original HDR instead of a 8 bit difference image, e.g. if the prediction was so bad to predict successive zeroes for the pixel lumas, whereas the HDR pixel lumas would e.g. be 65000, 65004 etc. (although such a worst case scenario is so unlikely one could constrain the codec to just make mistakes in that case). In any case, testing some of those predictive codecs with a correction picture, we found that they may require a large amount of encoded data, and that in particular this data may encode image information which is not really so relevant to the HDR experience, such as e.g. a correction of prediction model errors which mapped the HDR lumas in the wrong direction, or noise or image structures which are not so relevant psychovisually, or at least not the most important image structures contributing to the HDR impact (in a hierarchy of HDR relevance, e.g. a flame may be important, and that look may already be encoded by few, well-chosen data words).
So it is an object of the below presented technologies to provide HDR encoding techniques (i.e. any encoding techniques of a higher quality of image regions along a luma range than classical LDR) which give a better control over the encoding of at least some if not all HDR aspects in a scene (i.e. lights, lighting of objects such as sunlighting of certain image regions, improved rendering of certain aspects such as local contrast, etc.), leading to such potential advantages as e.g. a lower bit rate, or at least more significant information in the hierarchy of encoded bits.