Effectively capturing an environment or scene by a set of cameras and rendering that environment to simulate views that differ from the actually-captured locations of the cameras is a challenging exercise. These cameras may be grouped together in a rig to provide various views of the environment to permit capture and creation of panoramic images and video that may be referred to as “omnidirectional,” “360-degree” or “spherical” content. The capture and recreation of views is particularly challenging when generating a system to provide simulated stereoscopic views of the environment. For example, for each eye, a view of the environment may be generated as an equirectangular projection mapping views to horizontal and vertical panoramic space. In the equirectangular projection, horizontal space represents horizontal rotation (e.g., from 0 to 2π) and vertical space represents vertical rotation (e.g., from 0 to π, representing a view directly downward to a view directly upward) space for display to a user. To view these images, a user may wear a head-mounted display on which a portion of the equirectangular projection for each eye is displayed.
Correctly synthesizing these views from physical cameras to simulate what would be viewed by an eye is a difficult problem because of the physical limitations of the cameras, difference in inter pupillary distance in users, fixed perspective of the cameras in the rig, and many other challenges.
The positioning and orientation of cameras is difficult to effectively design, particularly because of various physical differences in camera lenses and to ensure effective coverage of the various directions of view from the center of the set of cameras. After manufacture of a rig intended to position and orient cameras according to a design, these cameras may nonetheless be affected by variations in manufacturing and installation that cause the actual positioning and orientation of cameras to differ. The calibration of these cameras with respect to the designed positioning and orientation is challenging to solve because of the difficulties in determining effective calibration given various imperfections and variations in the environment in which the calibration is performed.
When generating render views, each captured camera image may also proceed through a pipeline to generate a depth map for the image to effectively permit generation of synthetic views. These depth maps should generate depth in a way that is consistent across overlapping views of the various cameras and that effectively provides a depth estimate for pixels in the image accurately and efficiently and account for changing depth across frames and between objects and backgrounds that may share similar colors or color schemes. In generating the depth maps, a large amount of inter-frame and inter-camera data may be processed, requiring extensive computational resources.
Finally, in render views, the various overlapping camera views can create artifacts when combined, and in some systems create unusual interactions when two or more cameras depict different colors or objects in an overlapping area. Resolving this problem in many systems may create popping, warping, or other problems in a render view. In addition, systems which use a single camera or stitch images together may not realistically simulate views for different eyes or at different locations.