Video coding and decoding using inter-picture prediction with motion compensation has been known for decades. Uncompressed digital video can consist of a series of pictures, each picture having a spatial dimension of, for example, 1920×1080 luminance samples and associated chrominance samples. The series of pictures can have a fixed or variable picture rate (informally also known as frame rate), of, for example 60 pictures per second or 60 Hz. Uncompressed video has significant bitrate requirements. For example, 1080p60 4:2:0 video at 8 bit per sample (1920×1080 luminance sample resolution at 60 Hz frame rate) requires close to 1.5 Gbit/s bandwidth. An hour of such video requires more than 600 GByte of storage space.
One purpose of video coding and decoding can be the reduction of redundancy in the input video signal, through compression. Compression can help reducing aforementioned bandwidth or storage space requirements, in some cases by two orders of magnitude or more. Both lossless and lossy compression, as well as a combination thereof can be employed. Lossless compression refers to techniques where an exact copy of the original signal can be reconstructed from the compressed original signal. When using lossy compression, the reconstructed signal may not be identical to the original signal, but the distortion between original and reconstructed signal is small enough to make the reconstructed signal useful for the intended application. In the case of video, lossy compression is widely employed. The amount of distortion tolerated depends on the application; for example, users of certain consumer streaming applications may tolerate higher distortion than users of television contribution applications. The compression ratio achievable can reflect that:higher allowable/tolerable distortion can yield higher compression ratios.
A video encoder and decoder can utilize techniques from several broad categories, including, for example, motion compensation, transform, quantization, and entropy coding, some of which will be introduced below.
Video coding according to the above technologies, historically, has often assumed input content captured from a single camera. Other content that has attracted attention is known as stereoscopic content: two camera signals from cameras spatially aligned such that the axis of capture is approximately parallel, when combined in a suitable renderer, can provide the illusion of a three-dimensional picture when viewed under certain conditions. As the camera signals are highly correlated, certain video coding technologies have been devised that correlate the two signals to obtain a coding efficiency higher than what the coding efficiency could be when both signals were coded individually. One of such technologies is known as multiview coding, as available in the form of profiles in both H.264 and H.265. In some cases, such Multiview coding can be extended to the combined coding of more than two camera signals, while still leveraging the similarity, if any, of the multiple camera signals. However, multiview coding in the aforementioned sense still operates on planar camera images.
Recently, input devices have become available that include potentially many cameras at capture angles that are not parallel. To the extent possible based on the physical layout, those input devices allow to capture a spherical volume of space. Such cameras may be marketed, and are referred to herein, as “360 cameras,” as they may capture a 360 degree field of view in all dimensions. Still image 360 cameras may operate by using a pan-tilt camera head which mounts a single camera with a lens that may capture a comparatively wide angle. By rotating both axis of the pan-tilt head to certain positions before taking a shot, a sequence of still images can be captured by the camera in such a way that the individual still images overlap to some extent. Using geometric information consistent with the control information used to control the pan tilt camera head, these images can be geometrically corrected and stitched together to form a planar image that can be input into traditional image processing technologies, for example for the purpose of compression and transmission. The geo-correction and stitching process is referred to herein as “projection.” Rendering a 360 image can involve the selection of a viewpoint or viewing direction pertaining to the 360 captured scene, reverse geometric correction, de-stitching, etc., to create a planar image suitable for viewing. The reverse geometric correction and de-stitching is referred to herein as “de-projection” or “inverse projection.” Ideally, the scene depicted in that image would be the same as if a planar image would have been captured in the viewing direction or from the selected viewpoint.
The above concept can be extended to the capture of video, as video can be represented by a series of still images captured and rendered in sufficiently short time intervals. 360 video capable cameras are commercially available in two basic variants. A first variant uses a rapidly rotating camera head with one or more cameras and appropriate lenses arranged such that, over the course of one rotation, a 360 degree scene (in one dimension) can be shot. The one or more cameras and lenses may be arranged such that the other dimension is covered. In order to obtain a frame rate of, for example 60 frames per second, the camera head has to rotate at, for example, a minimum of 3600 revolutions per minute. In order to avoid camera blur, the capture time of the cameras may have to be selected very short, which may limit the number photons the camera sensors are exposed to, leading to noisy images, need of high illumination of the scene, or both. Other implementations can omit the mechanically critical rotating head through the use of many cameras and appropriate lenses that are arranged such that the overlapping view of all cameras and lenses captures the whole 360 degree sphere, avoiding the aforementioned problems at the additional cost of requiring many more cameras and lenses. Mixing forms of the two concepts are also possible. Due to the decreasing cost of electro-optical components relative to mechanical components, there appears to be a trend away from mechanical 360 cameras towards multi-lens cameras. Further, some designs omit the capture in certain, often relatively narrow, capture angles based on the understanding that the 360 camera, being a physical device, necessarily needs to be mounted somewhere, and that the mounting hardware is likely of limited interest to the viewers. Like in the still camera above, many 360 capable cameras geometrically project the images (captured in the same instant in time, or nearly so in case of a rotating head) together so to form a series of projected images representing a 360 degree view of the camera.
The projection of an image representing a spherical capture scene onto a planar surface has been a known and well-studied problem for centuries. One well-known projection is, for example, the Mercator projection, introduced in 1569, which is a cylindrical projection and still in use in many maps of the world. Since then, many other projections have been devised, including, for example, equirectangular projection, conic projection, Aitoff projection, Hammer projection, Plate Carree projection, and so forth. Referring to FIG. 1, shown are a few (of many) projections that may be suitable for the mapping of spherical capture scene onto a planar surface, and have been studied in the context of 360 degree video compression. Shown is a globe (101), with three projections to a planar map of the globe. The first projection is known as equirectangular projection (102). The second projection is a cubical projection, wherein the surface of the globe is projected on six square, flat, square surfaces that represents the six directions at 90 degree displacement in each dimension. The six squares can be arranged on a single planar surface, resulting in a cube map (103). The arrangement of the surfaces of the cube in the planar surface presented here is one of several options. Finally, an icosahedronal projection projects the globe's surface on the surface of an icosahedron (104) (a three-dimensional symmetric geometric figure composed of 20 triangular flat surfaces), and those 20 triangular surfaces can be arranged on a single planar surface (105). Again, many sensible options exist for the spatial allocation of the 20 triangular surfaces on the single planar surface (105).
These, and other suitable projection formats attempt to map a spherical surface to a planar surface. The planar representation necessarily cannot be a mathematically correct representation of the geometric features of the sphere, but rather an approximation which has a certain amount of error. Where, spatially, that error is located and how big it can become depends on the nature of the projection. For example, it is well known that the equidistant projection significantly overstates longitudinal distances at latitudes far away from the equator. For example, in an equidistant projected map of the world, the island of Greenland is depicted larger than the continent of Australia, although in reality it has only about ⅓rd of the surface area.