Immersive video, also known as 360 degree video or spherical or panoramic immersive video, is a video recording of a real-world scene where the view in every direction is captured. During playback, the viewer has control of the viewing direction, which can be controlled via a mouse, keyboard, or head movement tracking sensors on a Head-Mounted Display (HMD), such as in the form of Virtual Reality (VR) goggles.
In an example of immersive video production, multiple cameras having overlapping fields capture all possible viewing angles. The video streams are aligned and un-distorted in a compositing server. Each video stream is processed frame by frame. Each frame, typically referred to as texture or image texture, is mapped onto polygon meshes. A polygon mesh is a projection of image textures onto polygons whose geometric vertices are arranged in a 2D or 3D coordinate system. If the server specifies a spherical view, the polygons are arranged in 3D coordinates. If a panoramic or dome projection is sufficient, the polygons are arranged in 2D coordinates.
The coordinates of the polygon meshes are calculated based on calibration data, which specifies how to rectify, rotate, and translate image textures. In the case of panoramic cylindrical projection, each image texture is wrapped and distorted into a flat cylindrical view to create a larger high-resolution video frame, typically referred to as a stitched frame or image. Finally, using the stitched frames, everything is put back into a new video stream. This process results in a stitched video covering a high resolution video cylinder. The final resolution might be for instance 8000×3000 pixels for cylindrical video. After further processing and delivery, the video can be eventually rendered on a 360 video player client, which wraps the video within the environment of the video player to allow the user to look around herself on a display, such as smart phone, web site, or HMD.
Central Processing Unit (CPU) based video stitching is relatively time-consuming. For instance, stitching one minute of 48 frames per second (fps) video can take about 4-5 minutes on a modern PC. Using a single Graphics Processing Unit (GPU), the same can be done in about 1.5 minutes. If using multiple GPUs at the same time, the time required goes down to about 20 seconds.
There are two main techniques for capturing immersive video. The first one generates spherical or panoramic video with lenses and mirrors using a single camera. The resulting immersive video has typically very low resolution. The second one uses multiple cameras to generate separate video streams that need to be stitched together during post-production. This approach produces a higher resolution since multiple high-resolution video streams are stitched together.
The immersive video is created by stitching together video streams produced by multiple separate cameras or a single camera using multiple lenses and mirrors so that the entire 360 degree scene can be covered. The stitched video is encoded and transmitted to user devices or clients using existing codecs and protocols. The user device or client wraps the video around the cylindrical or spherical environment of the video player to allow the user to look around. There are a number of significant shortcomings with this prior art technology.
For instance, the user device or client needs to be relatively powerful in order to be capable of carrying out the processing associated with wrapping the immersive video and extracting and rendering parts of it in the display of the user device. On mobile devices, this consumes a lot of battery power. Furthermore, the immersive video needs to fit into the resolutions supported by the user devices and the codecs and protocols they use. This may result in a low resolution since the entire 360 degree scene is sent to the user device or client.
There is therefore room for improvement within the field of immersive video.