HDR Imaging
The human eye can see very large differences between brightness levels. Within the same scene seen simultaneously, the ratio between the darkest parts of the scene that do not dissolve into pure black and the brightest parts of the scene that do not dissolve into pure white can exceed 1:10,000. One example would be looking at a person who stands with his back facing the sun. The person will emit far less light than the sky around the sun, and even though the sun itself is too bright for the human eye to see details on it, one will have no difficulty seeing, approximately simultaneously, the persons's face in the shadow and very bright clouds near the sun.
This large dynamic range has presented difficulties for photographers and filmmakers ever since the invention of these technologies since the dynamic range that can be captured by photographic film and, more recently, by electronic image sensors tends to be much less than the eye's dynamic range. As a consequence, if a scene with high dynamic range is captured on photo or video, it often happens that either parts of the scene are underexposed or parts of the scene are overexposed. In the example with the person standing with his face to the sun, a photographer or videographer could choose between having the person's face correctly exposed, but the sky overexposed and appearing as pure white, or having the sky correctly exposed, but the person's face appearing very dark or black with little detail visible. In the days before digital imaging, photographers and filmmakers would use techniques such as dodging and burning to correct these problems partially, but the process was burdensome and the results often imperfect.
In the field of still images we have seen a partial improvement of the situation with the emergence of technologies often grouped as ‘HDR photography.’ A typical workflow goes like this: A photographer wishing to capture a scene with a dynamic range greater than can be captured with sufficient detail with his camera sets his camera up on a tripod to keep it still over several exposures. Then he selects an exposure bracketing function on his camera that causes the camera to vary exposure by a predefined amount over successive images. He presses the shutter button repeatedly, or has an electronic control unit do so for him, and in this way obtains images of the same scene with different exposures; each part of the scene should be properly exposed in at least one of the images taken. Then the photographer can use a software package such as Photomatix® Pro, made by HDRsoft LTD of Brighton, United Kingdom, to combine these multiple exposures into a single tonemapped image. This workflow is inconvenient, but it is routinely practiced by photographers now.
The abbreviation HDR stands for ‘high dynamic range,’ and the abbreviation LDR stands for ‘low dynamic range.’ Whether a given dynamic range is considered high or low depends somewhat on circumstances and the environment one is talking about. Certainly, today the dynamic range of 8-bit color channels as they are commonly used on computer monitors and more affordable consumer cameras would be considered low, and a dynamic range corresponding to more than 16-bit resolution would be considered high. Inbetween, one typically considers a dynamic range high if it cannot be attained by the ‘on-board’ means of the technology in question. For example, most file formats to store motion pictures for playback on consumer devices are limited to 8 bit today, and so in this context any dynamic range higher than that would be considered high. Many digital single-lens reflex cameras can produce still images with a dynamic range corresponding to 10 bit resolution, and so this dynamic range might not be considered high in the context of recording still images with such cameras, but as of the time of this writing it would still be considered high for taking still images on smartphone and cheap consumer cameras. The term ‘high-dynamic-range’ or ‘HDR’ is also often used as a shorthand for ‘low-dynamic-range image obtained by tonemapping a high-dynamic-range image.’
The exact algorithms used by software packages such as Photomatix® Pro typically are secret, but they function approximately like this: First, the several pictures of different exposure values are merged into an intermediate picture of high dynamic range. This is achieved by inferring the camera's image transfer function (the camera's mapping from amounts of light received on a pixel during the exposure to the luminance value reported by the camera for that pixel), applying the inverse of this function to the images, and then calculating for each pixel a weighted average of the luminance seen by the camera, giving more weight to the exposures that are correctly exposed for that pixel and less weight to the exposures that are not correctly exposed. In this way, one arrives at a luminance map, a good representation of the amount of light that actually hit the camera at the time the exposures were taken. This part of the process is understood fairly well in the art.
Once the HDR processing software has arrived at the luminance map, the work is not done, however. Since the luminance map is a fairly accurate representation of the actual scene's light thrown in the direction of the camera, it, too, has a high dynamic range. Unfortunately, most existing display technology can only display a fairly low dynamic range, often just eight bit corresponding to a range of 1:255 (remember that the eye can see more than 1:10,000). Thus a way is needed to compress the large dynamic range of the luminance map into a much smaller dynamic range that can be shown on an electronic screen or printed on photo paper. This process is called ‘tonemapping.’
Tonemapping is as much an art as it is a science, with many approaches documented in the literature and more approaches being contained in proprietary software products without having been documented for the public. Broadly, however, there are two approaches one can take: either spatially uniform tonemapping or spatially varying tonemapping. In spatially uniform tonemapping, the value of a given pixel in the mapped image depends solely on the value of that pixel the luminance map or pixels in its immediate vicinity—the tonemapping is spatially uniform in that it considers only one or a few pixels at a time and uses the same uniform rule for all of the pixels so considered. This brings obvious advantages in processing speed. Some techniques for spatially uniform tonemapping, well-known in the art, are application of a power-law function (often called ‘gamma’), application of a logarithm function, and histogram equalization, perhaps followed by improving local contrast using a 3×3 or 5×5 kernel.
Spatially uniform tonemapping has a serious deficiency, however. Consider the example we started this discussion with, a person standing with his back to the sun. Now a human would perceive the white of that person's eyes or his teeth as white and dark parts of a cloud near the sun as dark, even though objectively the dark parts of the cloud are still sending much more light into the viewer's eye than the subject's eyes or teeth. This is because humans view brightness not in absolute terms but relative to other parts of the same object. So we see the person's eyes or teeth as white because they are much brighter than the other parts of the person's face, and we see the dark parts of the cloud as dark because they are much darker than the cloud's bright parts or the sun next to the cloud. This is taken into account by spatially varying tonemapping operators that calculate the mapped brightness of a pixel not just from the pixel itself or pixels in its direct neighborhood, but also from other pixels in the target pixel's wider vicinity or even all of the image. Many different approaches are known and practiced in the art. A very simple example would be to apply an unsharp mask filter with a radius of, for example, 150 pixels to the image. An important property of this spatially varying tonemapping is that it can reduce global contrast in the image, but preserve or even enhance local contrast, which is also what the human visual system does. This approach can lead to the halo artifacts on images that one often sees on tonemapped HDR images. Better algorithms can reduce the halo effects, but to some extent they are a price one has to pay for compressing high-dynamic-range images for low-dynamic-range display.
For the purposes of this patent application we will call a tonemapping operator ‘spatially varying’ if the tonemapping operation performed on each pixel can be different for pixels in different parts of the image and depends upon the values of pixels not in the immediate vicinity of the pixel to be tonemapped. A common property of spatially varying tonemapping operators is that they preserve, enhance, or reduce contrast differently at different scales, or frequencies. For example, a spatially varying tonemapping operator might preserve contrast at the pixel scale, enhance contrast at a scale of 50 pixels, and reduce global contrast. By way of example, we would not consider a tonemapping operator using a single pass of a 3×3 or 5×5 kernel spatially varying, but we would consider a tonemapping operator spatially varying if that operator applies a 3×3 kernel iteratively on an image pyramid so as to obtain the effect of a sequence of kernels from the pixel scale to as large as the image itself. We would also consider a tonemapping operator that uses the Fourier transform to process different frequencies in different ways spatially varying. We would not consider an operator spatially varying if not at least some of the spatial variation is derived from the image itself; for example we would not consider an operator spatially varying that applies a different rule to pixels in the top half of the image than to the bottom half under the fixed assumption that the top half is the sky. Similarly, we would not consider a tonemapping operator spatially varying if the spatial variation is based on a human marking up certain sections of an image as opposed to computation from the image itself.
Another technique that has gotten some recognition in the art over the past few years, and that is offered by the Photomatix® Pro software package and others as an alternative option, is ‘image fusion.’ In image fusion, one skips the step of calculating a luminance map and obtains a tonemapped low-dynamic range image directly as a weighted average of the several original exposures, where for each pixel the original exposures are weighted according to how properly exposed they are; the better-exposed images get more weight and the worse-exposed images less. This is equivalent to first computing a luminance map and then applying a particular spatially uniform tonemapping operator. After this image fusion one may or may not apply a spatially varying tonemapping operator so as to enhance details on the image.
HDR Video
While the process to capture HDR still photographies is still burdensome but well-understood in the art, there is no such well-understood process to obtain motion pictures that capture scenes with high dynamic range and generate video footage suitable for display on electronic screens or projectors with low dynamic range.
Substantial progress has been made in prior art on one part of the equation, on the question of capturing details of a scene in high dynamic range. Perhaps the most advanced cameras on the market today that can do this are the ones made by Red.com, Inc. of Irvine, Calif. One technique employed by some of this company's cameras is taught in U.S. Pat. No. 8,159,579, which shows how to capture two images of different exposure levels near-simultaneously from the same sensor and write them to two separate video tracks. U.S. Pat. No. 6,593,970 teaches taking several exposures separately for the red, green, and blue image channels, and U.S. Pat. No. 5,247,366 teaches taking several exposures and combining them into one video frame component-wise by means of neighborhood, i. e., spatially uniform, processing.
While there has been progress on recording information from a scene with high dynamic range, there has been much less progress in turning these recordings, delivered as separate video tracks at least in the case of U.S. Pat. No. 8,159,579, into one video track that can be shown in satisfactory quality on an electronic screen of limited dynamic range. U.S. Pat. No. 5,247,366 teaches channel-wise neighborhood processing using a three-by-three kernel. U.S. Pat. Nos. 5,420,635, 5,517,242, 5,929,908, 6,418,245, 6,496,226, 6,593,970, 6,670,993, 6,707,492, 6,720,993, 6,952,234, 7,061,524 7,106,913, 7,133,069, and 8,072,507 also teach various permutations of, or equivalent to, spatially uniform tonemapping. In U.S. Pat. Nos. 6,204,881 and 6,985,185 the user has to select whether to show dark or bright parts of the picture properly exposed, which is contrary to the normal purpose of having high-dynamic-range video, and U.S. Pat. No. 7,193,652 proposes displaying different exposures side by side, which again is not how people normally want to experience a motion picture. In U.S. Pat. No. 6,677,992 even the patent's title “Imaging apparatus offering dynamic range that is expandable by weighting two image signals produced during different exposure times with two coefficients whose sum is 1 and adding them up” clearly advertises the purely spatially uniform nature of the tonemapping taught by this patent. U.S. Pat. No. 8,014,445 teaches a method of encoding a high-dynamic-range signal in such a way that it can be played back on a low-dynamic-range display with reduced detail, but that does not give us a display with proper detail either.
Perhaps the most symptomatic expression for the malaise of recording video captured from scenes of high dynamic range for playback on screens with low dynamic range can be found in U.S. Pat. No. 7,239,757. This patent actually discusses the problem of tonemapping explicitly. For a tonemapping algorithm, it refers the reader to the paper by Mitsunaga and Nayar (2000), and in this paper the authors write that it is “hard to print/display the entire dynamic range of the computed image.” Thus the spatially uniform tonemapping algorithm proposed in this patent will, according to the algorithm's inventors, produce images that cannot be displayed on an ordinary screen. This might not be a problem if the video is being recorded for further postprocessing or for machine vision applications. But to this day there is no satisfactory solution that would let a user take a reasonably compact camera, or even a cellular phone he is carrying anyhow, record a scene with high dynamic range on video, and share that video without further processing on a social network such as Facebook®, run by Facebook, Inc, of Menlo Park, Calif., a very common use of video recordings today, or even just to play the recording back in satisfactory quality on the device on which it was recorded.
So far our discussion of the difficulties with existing tonemapping operators focussed on spatially varying versus spatially uniform tonemapping and found that prior art regarding video processing teaches to apply spatially uniform tonemapping.
In tonemapping video, we have a spatial dimension that we may deal with, as in still images, in a spatially uniform or spatially varying manner. Video, however, has an additional dimension that still images do not have, the dimension of time. U.S. Pat. No. 7,239,757 makes an interesting contribution with its teaching of “temporal tone mapping,” which works “by essentially carrying over the statistics from frame to frame.” It is important to note here that the statistics being carried over in the process taught in this patent are “global parameters” applying to the entire image and used in a spatially uniform tonemapping process.
This problem of temporal variation in tonemapping has not been treated much in the patent literature, and we will thus discuss prior art in the scholarly literature. Pattanaik et al. (2000), Kang et al. (2004), Ramsey et al. (2004), Irawan et al. (2005), Youm et al. (2005), and Van Hateren (2006) deal with temporal adaptation in the context of rendering HDR videos. Their temporal adaptation mechanism are all spatially uniform, i.e., they react to global changes in luminance.
Wang et al. (2005) make the interesting proposal of viewing a video as a three-dimensional cube (two dimensions for the image and one for time) and applying a gradient-domain tonemapping technique extended to three dimensions to this cube. This approach, if implemented, could not be trivially done live while video was being recorded since it needs to look into the future as well as into the past in the same way that tonemapping a still image needs to look left of a pixel as well as right. However, the authors “choose to attenuate only spatial gradients,” thus leaving, in effect, the temporal aspect of tonemapping spatially uniform. Their method is also far too computationally intensive to apply it while recording video. For a video with a resolution of 256×256 pixels, they report that their method needs 25 seconds per frame on a desktop computer. Today computers are faster than in 2005, but we have also come to expect resolutions far higher than used in this paper. Even the camera in a cellular telephone often captures video in resolutions of 1920×1080, which is 30 times the number of pixels used in their paper. For real-time video, we need to be able to deal with about 25 frames per second, not 25 seconds per frame.
Bennett et al. (2005) have an algorithm that filters over both space and time, but their paper deals with the problem opposite of the one that we deal with—their approach accepts that the video captured by a video camera will be in low dynamic range and have quality problems, and they seek to reconstruct a more pleasant video from whatever information is left in that low-quality video stream. This is a possible approach to deal with the problem, but it would indubitably be more desirable to have a high-quality video to begin with.
The limited progress we have seen with regard to tonemapping high-dynamic range video signals has very likely to do with many proposed tonemapping operations being computationally expensive so that a sufficient speed for processing video cannot be attained easily. The algorithms, I suspect, tend to be so computationally intensive in part because the academic literature developed to a large part out of attempts to implement sophisticated biological models of the human visual system in software, not out of attempts to make a system that is practical, even if it sometimes sacrifices fidelity to what the human visual system is doing. (Note, however, that it is not obvious that a tonemapping algorithm even should attempt to replicate the human visual system precisely—what matters is the subjective quality of the tonemapped images, and it is not obvious that replicating the visual system optimizes that). In part for ease of implementation, in part because these sophisticated algorithms are not always easy to parallelize, many of the methods for tonemapping video proposed in the research literature execute on a computer's CPU, often only on a single thread.
Goodnight et al. (2003) propose doing the tonemapping on a graphics processing unit (GPU). Their algorithm uses a spatially varying tonemapping operator, but temporal variation is spatially uniform, and their method depends on a powerful graphics card in a desktop computer. Chiu et al. (2009) developed a special processor for hardware-accelerated tonemapping of still images that one could also adapt to perform the same tonemapping on video, again with spatially uniform temporal variation.
In short, all video capturing equipment known heretofore suffers from several disadvantages. Most video capturing equipment, and in particular that aimed at consumers and easily portable, does not attempt to solve the problem of high-dynamic-range recording at all. Some specialized solutions such as the cameras made by Red.com, Inc. of Irvine, Calif. are capable of capturing illumination detail from a scene in high dynamic range, but they produce several video tracks of an enormous data volume and leave it to the user somehow to turn these video tracks into one video track playable on a normal screen. This is acceptable where extensive postproduction work (‘grading’) is planned anyhow in professional recording settings, but it is not suitable for home use, wedding videography, and other activities where there is no desire, time, skill, or budget for substantial postproduction work. Several patents teach different methods of combining multiple exposures per video frame into one frame, but they all rely on spatially uniform tonemapping and consequently either produce video streams that cannot be displayed on normal computer or TV screens or produce video streams with very unsatisfactory local contrast. The result is, in the words of Bennett et al. (2005), that “people have long been accidentally capturing poorly exposed video with camcorders and motion-picture cameras (countless home videos of school plays and dance recitals lay testimony to this phenomenon)” with no practical solution for this problem discovered heretofore.