An image of one or more objects in a scene can be captured from the viewpoint of a camera. For example, the image may be a visual image, e.g. representing the visual appearance of the objects in the scene, e.g. in a format using Red, Green and Blue (RGB) values for pixels of the image, or in a format using luma and chrominance values (e.g. YUV). In some cases there may be more than one camera capturing different images of a scene. Each image of the scene represents a view of the scene from the viewpoint of the respective camera. The images may represent frames of a video sequence.
As well as capturing the visual input images, depth images may be captured representing the distances to points in the scene from the camera as a function of pixel position. Depth cameras for capturing depth images are known in the art, and may for example work by projecting a pattern of infrared light into a scene and inferring depth from the disparity introduced by the separation between projector and sensor (this is known as a structured light approach). Alternatively, depth cameras may use a time of flight approach to determine depths by measuring the time taken for rays of infrared light to reflect back to the sensor using interference, and from this inferring the depth of points. As another alternative, depth images can be acquired from a scene reconstruction which is registered to the scene, given knowledge of the camera calibration, for example by rendering the distance to points in the scene by means of a depth buffer.
Images are produced by the interaction of light with the surfaces of objects in a scene. If the surface properties that produce an image, or set of images, can be found, then the image of the scene can be manipulated (e.g. relit under arbitrary lighting conditions) using conventional computer rendering techniques. Albedo (which may be referred to as “intrinsic colour”), shading, surface normals and specularity are examples of intrinsic surface properties, and techniques that estimate these from one or more images are known in the art as “intrinsic image methods”. Similarly, the extension to video is known as “intrinsic video”. It can help to simplify the problem of estimating the intrinsic surface properties to assume that the objects are non-specular and that the scene lighting is diffuse.
Input images captured by cameras have implicit real-world lighting information, such that lighting artefacts are present (i.e. “baked-in”) in the images. In order to relight the objects shown in an input image, an image processing system can attempt to split the image values (i.e. pixel values) of an input image into a shading component and an intrinsic colour component of the objects in the image. The intrinsic colour component can be used for rendering the objects under different lighting conditions. The splitting of the image values into shading components and intrinsic colour components is not a simple task. Therefore, typically, such image processing is performed “off-line” in a post-processing step after the images have been captured because the amount of time and/or processing resources required is typically large. Furthermore, this image processing is normally limited to static scenes, rather than performed on video sequences of moving objects.