Recently new developments have occurred regarding the encoding of images/video (whether of captured scenes or computer graphics), namely, it is desirable to better capture the entire range of object luminances and colors occurring in nature, up to large luminance values like e.g. 25000 nit (e.g. sunlit clouds) and often also low values like 0.01 nit, which is called HDR (high dynamic range) encoding. Until now, classical image capturing systems (i.e. the chain starting at the camera—and even appropriate scene lighting which was typically relatively uniform—followed by encoding for e.g. image storage or transmission, up to the display of the image) have handled high dynamic range scenes (i.e. scenes in which there are simultaneously important dark regions with low luminances and significant objects therein, and bright regions with high luminances, in particular if there also various important regions of intermediate luminances (various greys), in particular if several of those scene luminances may not easily map to what is usable by a component in the chain, such as e.g. a linear mapping based rendering on a display) in a severely distorting way. E.g. if the action was happening inside an enclosed volume of a first light level (illuminance), such as a car or room, regions of brighter lighting, such as the environment seen through the window may have been captured, or at least represented in the signal with very low quality (namely pastel, washed out or clipped colors). This is especially so for cheaper CMOS based cameras, compared to the more forgiving behavior of e.g. celluloid film. In particular, only a few hardly representative code values may have been associated with the objects in these bright regions, which may result in bad representation of the object textures, or even blunt clipping to the maximum value of the color space used for encoding. Having so little data in these regions of the luminance axis of the captured image, also means that processing functions e.g. optimizing displayed image contrast may have problems to produce good final pixel data. Having available ever better displays nowadays and in the near future (e.g. with peak brightness of several 1000s of nits), or at least smarter image processing technologies, one may desire to improve upon that situation, to be able to create rendered images of higher quality.
For several reasons, at least for a number of years into the future, one may desire some form of backwards compatibility, which means that data of a so-called low dynamic range (LDR) encoding must be available or at least easily determinable from the available encoding, so that e.g. a novel upgraded video processing box can deliver an LDR signal to a lower dynamic range display (e.g. a mobile display). Also from a point of view of storage, it may be very useful to store an image signal in as versatile as possible a manner, i.e. not just with the maximum amount of useful data about the scene, but also in a manner that this data will serve many potential future applications, especially if in a simple way. Typically the shooting of a movie e.g. takes so much effort, that the raw signal is highly valuable, and one better encodes this in the best possible way a technology allows. Not to fall in a trap that even the master encoding of a program is for a later generation of better quality display systems below what could have been achievable if the data was encoded differently. That avoids not only having to do an expensive stunt all over, but the reader can imagine that some to be recorded events like the marriage of a royal couple or a wedding video of a normal couple won't be done over. And trying to remaster such a video for a new generation of display technology is, if not very difficult, at least cumbersome. It is preferable that the encoding allows capturing the scene optimally in the first place, and even easily allows for later improvements, by its very encoding structure. Independent from how it is rendered on a particular display plus viewing environment, the information present in current LDR encodings such as JPEG (depending inter alia on the particular captured scene and used camera system), is currently seen as (limited to) approximately 11 linear bits or stops. Of course if encoding is to be used directly for rendering (e.g. display-referred) some of the information bits may not be visible. On the other hand, a codec may contain information from the original scene or graphics composition (scene-referred), which can become relevant e.g. when a display is changing its human-visible gamut by means of image processing. So it is important to have at least the more important image objects well-represented in the coded image.
A HDR capturing chain is more than just pointing a camera at a scene with a large luminance contrast ratio between the darkest and the brightest object and linearly recording what there is in the scene. It has to do with what exactly the intermediate grey values for all the objects are, since that conveys e.g. the mood of a movie (darkening already some of the objects in the scene may convey a dark mood). And this is a complex psychological process. One can e.g. imagine that psychologically it isn't that important whether a bright light is rendered on a display exactly in such proportion to the rest of the rendered grey values as the scene luminance of that light was to the rest of the scene object luminances. Rather, one will have a faithful impression of a real lamp, if the pixels are rendered with “some” high display output luminance, as long as that is sufficiently higher than the rest of the picture. But that distribution between self-luminous and reflecting objects (in the various illumination regions of the scene) is also a task depending on the display gamut and typical viewing conditions. Also one may imagine that the encoding of the darker regions is preferably done so that they can be easily used in different rendering scenarios such as different average surround lighting levels, having different levels of visibility for the darker image content. In general because this is a difficult psychological task, artists will be involved in creating optimal images, which is called color grading. In particular, it is very handy when the artists make a separate LDR grading, even if that is done in a “pure HDR encoding strategy”. In other words in such a scenario when encoding a sole HDR camera RAW signal, we will still also generate an LDR image, not necessarily because it is to be used for a large LDR fraction of the video consumption market in the future, but because it conveys important information about the scene. Namely there will always be more important regions and objects in the scene, and by putting these in an LDR substructure (which can conceptually be seen as an artistic counterpart of an automatic exposure algorithm, yet after the full capturing, and in relation to captured luminances outside that range), this makes it more easy to do all kinds of conversions to intermediate range representations (MDR), suitable for driving displays with a particular rendering and viewing characteristics. By using such a technical framework, we can even with a single encoding image, at the same time taylor for e.g. LDR displays like a mobile display with a peak brightness of 50 nit (indoors, or a higher brightness but competing against high outdoors illumiance), a mid range peak brightness MDR display of say 1200 nit, and a HDR display of say 8000 nit peak brightness. In particular one may tune this LDR part according to several criteria, e.g. that it renders with good quality on a standard reference LDR display (the colors look similar as far as possible to those on the HDR display), or conveys a certain percentage of the total captured information (e.g. a certain amount of the image is visible), etc. We will in our below proposed codec implement that such receiving display can from that single all-encompassing scene encoding (or grading) can easily identify what are e.g. the dark regions, so that it can optimally taylor the incorporated visibility thereof given its known characteristics of the displaying system.
There are not so many ways to encode a HDR signal. Usually in prior art one just natively codes the HDR signal, i.e. one (linearly) maps the pixels to e.g. 16 bit float words, and then the maximum captured luminance value is the HDR white in a similar philosophy to LDR encoding (although psychovisually this usually is not some reflective white in the scene, but rather a bright color of a lamp). This is a native scene-referred encoding of the original scene object luminances as captured by the camera. One could also map a full range HDR signal to the 8 bit LDR range via some “optimal” luma transformation function, which would typically be a gamma function or similar. This may involve losing color precision (in view of the trade-off between range and precision for such encodings) with corresponding rendering quality issues, especially if at the receiving side image processing such as local brightening is expectable, however the dominant grey value grading of the image objects (e.g. the average over an object) is roughly preserved (i.e. their relative/percentual luma relationships).
Prior art has also taught some HDR encoding techniques using two picture data sets for each HDR image, typically based on a kind of scalable coding concept, in which by some prediction function, the precision of a “LDR” encoded local texture is refined, or stated more accurately, i.e. projected to a HDR version of that texture, typically by scaling the LDR luminances (the LDR in those technologies is normally not a good looking LDR grade already suitable for optimal rendering on a typical (reference) LDR display, but typically a simple processing on the HDR input). Then the second picture is a correction picture for bringing the predicted HDR image close to the original HDR image to be encoded. There is some similarity to the single HDR image encodings, through the prediction functions serving as some range/precision definition criterion, only in these technologies the encoding is performed with two pictures.
Scaling the lumas of a base-band image involves applying a transformation, and this predicting transformation is often defined per block, to reduce the amount of data to be encoded. This may be already wasteful data-wise, since many blocks will contain the same object, and hence need the same transformation.
As said the difference of the original HDR image with the prediction may be co-encoded as an enhancement picture to the degree desired, yet as far as possible given the range and definition of the enhancement image. E.g., one may represent a HDR gray value of 1168 with a division by 8 to a value 146. This HDR value could be recreated by multiplying by 8 again, but since a value 1169 would quantize to the same base layer value 146, one would need an enhancement value equal to 1 to be able to recreate a high quality HDR signal. An example of such a technology is described in patent EP2009921 [Liu Shan et al. Mitsubishi Electric: Method for inverse tone mapping (by scaling and offset)]. An interesting question about such methods is always what the enhancement method actually brings as visual information improvement. It is normally applied blindly, and may e.g. for textured regions sometimes not contribute relevant additional information, especially for fast changing video.
Another two-picture encoding is described in the currently not yet published application U.S. 61/557,461 of which all teachings are hereby incorporated by reference.
Now there are problems with all the existing HDR encodings. Just applying global transformations may be much to coarse according to what the content creator desires after having invested so much in e.g. a movie (special effects). Other applications may be less critical like a television program making, but still good control over the final look is desirable. That would at least come at the cost of needing many encoded data bits. On the other hand specifying intricate transformations per pixel also involves a large amount of data to be encoded. This applies to e.g. needing to encode a second image being a lightness boost map, for object texture reflections being encoded in a first image. Also, herewith one blindly encodes anything possibly occurring on the input, not knowing much about what is actually in the image (i.e. not allowing versatile use), even not realizing there may be a large amount of redundancy in the boost image. Let alone that this blind data is easy to use for smart algorithms like e.g. picture improvement or optimization algorithms at the display side.
Working on a block basis reduces the amount of data, but still is not optimal. In particular this block structure also being rather blind to the actual image content, and more annoyingly, imposing a new geometric structure being the block grid, which has nothing to do with the underlying image, and may hence match more or less conveniently with the image characteristics (in particular the image geometry), means that several block-coding related artifacts may occur. In fact a block is not much more than just a large pixel, and not really a smart content-related structure (neither as regards the color-geometrical structure of that object or region, nor its semantic meaning, such as it e.g. being an object which should be mostly hidden in the dark).
The below embodiments aim at providing easy technical measures to mitigate at least some of those artifacts.