Field
This disclosure relates to three dimensional video capture and more specifically to a system for fusing live-action stereoscopic video with LIDAR three-dimensional data to create volumetric virtual reality video.
Description of the Related Art
Prior art systems for generating three-dimensional environments for virtual reality (VR) application fall into two basic categories. The first category is fully-rendered three-dimensional environments. These environments are generally created by developers and artists using “game engine” software to create three-dimensional objects within a space and to apply art and lighting effects to those objects to give them the appearance of physical objects. Thereafter, a user may “enter” the three-dimensional environment created by the developer and artist.
These environments have the benefit of being fully-realized three dimensional spaces. Typically, avatars of a user (or in the case of VR, the user themselves) can move freely about within such spaces because they are designed for the purpose of being fully-explored. The problem with these spaces is that they only estimate real locations and, more basically, require days or weeks of work by developers, artists, and, if you consider the development time for the game engine, even longer times to create the software that enables other developers and artists to make the environment at all. Though there are tools that can automate parts of these environment-creation processes, much by-hand work must be done to make the believable and fully navigable by an avatar or user. More complex systems combine the two methods to perform detailed photogrammetry on locations that will be or have been the subject of two-dimensional filming in order to derive some three-dimensional data. Then, after-the-fact, the three-dimensional data may be combined with the video to create somewhat of an immersive video environment. Because of the time and work involved, none of these systems is really suitable for capturing any “live-action” video while easily recording the characteristics of the associated three-dimensional space.
The other category is an “on-rails” video or series of images created by cameras with overlapping fields of view such that an entire sphere of images may be “stitched” together by software to create a “bubble” around a viewer. This category feels a bit like going along for a ride in its video format incarnations or, in individual image capture-oriented incarnations, transitioning from one fixed position to another. While within the “bubble”, a user or avatar may “look around” at the interior of the sphere of images incasing them. These systems provided very high-quality images that accurately reflect the place in which those images were taken (typically an outdoor space). However, these images suffer from parallax issues and the stitched images are often poorly aligned.
However, the avatar or user may not deviate from the pre-selected path or fixed positions. And, the images have no three-dimensional component whatsoever. Because movement is not envisioned, it is less-important to have depth information. But, for true three-dimensional environments with at least some degree of freedom of movement within the environment, depth information, like that available in the fully-realized three dimensional environments created using “game engine” style software is highly desirable.
Stereoscopic photography, using two cameras to capture the three dimensional characteristics of elements visible in two corresponding images created by the two cameras, has been used to estimate the relative depth of objects within images. However, because virtual reality systems preferably use fully-immersive fully-surrounding spherical spaces, often exteriors, and further because exteriors have depths that are virtually infinite (the sky) and tend to have long fields of view (e.g. a building several blocks away), stereoscopic photography's applicability is limited. In order to calculate the depths, a visibly perceptible disparity between the two corresponding images must be present. At great distances, the disparity between objects within two images is minimal if it is present at all. So, the use of stereographic photography to record video, often in exterior, open spaces, is inaccurate and insufficient to create fully-surrounding three-dimensional spherical spaces in which virtual reality users can move.
Depth sensor based systems such as the Microsoft® Kinect enable similar functionality, but are only capable of operating in one direction—namely toward a user—and have extremely limited range. Therefore, these types of systems are not suitable to outdoor environments or 360° spherical video recording and three-dimensional reconstruction of filmed environments.
Similarly, LIDAR systems have existed for some time, but have been prohibitively expensive for general use. In addition, the depth data generated by LIDAR systems has not been easily combinable with other data or easily translatable into data that may be used to re-create three-dimensional environments. This is, in part, because the LIDAR data, though incredibly accurate, is very sparse within a given target environment—meaning that the LIDAR depth data points are relatively distant from one another. This sparsity makes LIDAR data alone inadequate for recreating accurate three-dimensional renderings of target environments.
Throughout this description, elements appearing in figures are assigned three-digit reference designators, where the most significant digit is the figure number where the element is introduced and the two least significant digits are specific to the element. An element that is not described in conjunction with a figure may be presumed to have the same characteristics and function as a previously-described element having the same reference designator.