1. Technical Field
The invention is related to a system and process for generating a panoramic video of a scene, and more particularly to such a system and process that employs a multi-camera rig to capture individual videos that collectively depict the surrounding scene, and which are then stitched together on a frame by frame basis to form the frames of the panoramic video.
2. Background Art
A panoramic video is a video made up of a sequence of panoramic frames depicting a surrounding scene. Ideally, the panoramic video makes available a seamless, 360 degree, view of this scene. In this way, a person viewing the panoramic video can select different portions of the scene to view on a real-time basis. In other words, a person viewing the panoramic video on the proper viewer can electronically steer his or her way around in the scene as the video is playing.
A number of different systems for generating panoramic videos have been previously developed. For the most part, these systems employ a mirror arrangement to capture the surrounding scene. For example, one existing system, referred to as a catadioptric omnidirectional camera system, incorporates mirrors to enhance the field of view of a single camera. Essentially, this system, which is described in a technical report entitled xe2x80x9cCatadioptric Omnidirectional Cameraxe2x80x9d (Shree K. Nayar, Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Puerto Rico, June 1997), uses a camera that images a hemispherical mirror to generate a panoramic still image with a 360xc2x0xc3x97210xc2x0 field of view. Another similar mirror-based system unwarps a spherically distorted video produced by the mirror-and-camera rig into a rectangular video stream then encodes it using standard streaming authoring tools. The person viewing a video produced via this system sees a sub-region of the scene captured in the panoramic video and can pan within the scene. While these mirror-based single camera systems are capable of producing convincing panoramic stills and video, they suffer from a relatively low resolution and a fairly complex camera rig owing to the mirror arrangements.
Another current panoramic video system that attempts to overcome the resolution and complexity problems, foregoes the use of a mirror, and employs a multiple camera head instead. The head consists of six cameras mounted on the six faces of a 2-inch cube, resulting in a 360xc2x0xc3x97360xc2x0 field of view. The system also provides post-processing software to stitch the video streams from the individual cameras into a panorama. This multi-camera system has higher resolution than the catadioptric systems described above, but has the disadvantage of an expensive stitching stage and parallax artifacts due to the cameras not sharing a common center of projection.
One other system of note employs both a mirror arrangement and multiple cameras in an attempt to achieve a higher resolution without the stitching and parallax problems of the non-catadioptric, multi-camera system just described. Essentially, this system uses the mirror arrangement to create a common effective viewpoint for the cameras. While this system improves the resolution and reduces the aforementioned stitching and parallax problems, it still requires the use of a complex mirror-and-camera rig.
The present invention is directed at a non-catadioptric, multi-camera system and process that is capable of providing high resolution panoramic video, minimal stitching and parallax problems and a relatively simple camera rig.
The present invention involves a system and process for creating a panoramic video. Essentially, the creation of a panoramic video in accordance with the present invention first entails acquiring multiple videos of the scene being depicted. Preferably, these videos collectively depict a full 360 degree view of the surrounding scene. The acquisition phase also includes a calibration procedure that provides information about the camera rig used to capture the videos that is used in the next phase for creating the panoramic video. This next phase, which is referred to as the authoring phase, involves mosaicing or stitching the individual videos together to form a single panoramic video. In addition the authoring phase can include an encoding procedure, which may involve compressing the panoramic video. Such a procedure is useful in applications where the panoramic video is to be transferred over a network, such as the Internet.
A specialized camera rig is employed in the acquisition phase to capture a series of videos of the scene. The camera rig preferably consists of multiple digital video cameras that are disposed in a back to back fashion such that their lenses each point in a radially outward direction and view a different portion of the surrounding scene. The cameras are mounted on a surface, which for calibration purposes is capable of being rotated 360 degrees. Ideally, the cameras would be mounted such that their optical axes are coplanar and intersect at a common point coinciding with the axis of rotation of the mounting surface. While it is desired to come as close to these ideal mounting conditions as possible, any misalignment will be identified as part of the calibration procedure and corrected during the generation of the panoramic video.
The number of cameras used will depend on their field of view characteristics. The procedures used in the aforementioned authoring phase will work best if the lateral field of view of each camera overlaps by at least 20 percent. Thus, at least as many cameras as needed to provide a full 360 degree coverage of the scene including the desired overlaps would be employed. It is noted, however, that to minimize the cost of the camera rig and to reduce the processing and memory requirements of the present panoramic video system, it is preferred to use as few cameras as possible without significantly degrading the resolution of the resulting panoramic video. In this regard, if the foregoing field of view overlap cannot be achieved using a reasonable number of cameras with a standard lens arrangement, it is permissible to employ wide angle lenses. For the most part, distortion introduced by the wide angle lenses will be identified by the calibration procedure and can also be corrected during the generation of the panoramic video,
While any digital video camera can be employed in the camera rig, it is preferred that the cameras be capable of recording in a progressive scan mode at 30 frames per second (i.e., a mode in which each raster line is sampled to produce each frame of the video, rather than every other line as is the standard recording mode of most video cameras). This mode is preferred as it is the typical mode for display on a conventional PC monitor, and it is envisioned that the panoramic video will primarily be viewed on a PC monitor. Image frames captured in this mode are also easier to stitch together to form a panoramic frame. In addition, each of the cameras should be adjusted to have as near as possible the same settings. For example, it is preferred that each camera be set to the same zoom, focus, exposure and shutter speed, as well as being white balanced in the same way and having any image stabilization feature turned off.
The camera rig is calibrated prior to being used to capture a panoramic video. The first part of the calibration procedure involves producing a calibration video with one of the cameras. This is accomplished by setting the selected camera to record and rotating the camera rig 360 degrees in the same direction. The video output from the camera during the recording sweep is stored as a video file (e.g., an .avi file). In one preferred embodiment of the calibration procedure, the next process action involves holding the camera rig stationary while capturing a single image frame with each of the cameras. These calibration images are also stored in memory. The images could also be obtained in other ways. For example, all the cameras could be set to record during the aforementioned 360 degree sweep and the resulting videos stored. In this case, the required calibration images would be taken from each of the videos by, for example, extracting an image frame from each video that was captured at approximately the same moment, or when the rig was stationary.
The calibration video will contain many more frames than are necessary to produce the panoramic image of the surrounding scene that will be used in the calibration procedure. To this end, it is preferred to extract only those frames that are needed from the calibration video. This can be accomplished by selecting just enough of the video""s frames needed to depict every part of the surrounding scene with an overlap between frames of about one-half to three-quarters.
Inaccuracies in the structure or alignment of the camera lenses can cause the images taken by the respective cameras to have different image centers. This is particularly true if wide angle lenses are employed. Thus, it is preferred that a centering action be performed at this point in the calibration procedure. Any conventional process capable of shifting the selected frames so as to exhibit a common pixel location for the center of each image can be employed for this purpose.
The next action in the calibration procedure involves mosaicing or xe2x80x9cstitching togetherxe2x80x9d the individual selected frames into a single panoramic image of the surrounding scene and determining the focal length and radial distortion associated with the video camera used to capture the calibration video. While any existing mosaicing process can be employed for this purpose, it is preferred that it be accomplished using the methods taught in U.S. Pat. No. 6,018,349 entitled xe2x80x9cPatch Based Alignment Method and Apparatus For Construction of Image Mosaicsxe2x80x9d.
Each of the previously captured calibration images from each camera are next stitched into the panoramic image. Here again it is preferred that the process described in U.S. Pat. No. 6,018,349 be used to accomplish this task. A by-product of this stitching process will be the rotation matrices for each camera. These matrices along with the previously computed focal length and radial distortion estimates are associated with each camera and stored for future use.
It is optionally possible to perform a block adjustment procedure at this point in the calibration to refine the estimates for the focal length and rotation matrices. Preferably, the block adjustment procedure described in U.S. Pat. No. 5,987,164 entitled xe2x80x9cBlock Adjustment Method and Apparatus for Construction of Image Mosaicsxe2x80x9d is employed for this purpose.
The frames of the calibration video used to create the original panoramic image are next deleted, thereby leaving just the calibration images captured by the cameras. These images form a panoramic image that will be referred to as the calibration panoramic image.
The aforementioned block adjustment procedure is in essence a global alignment of the images making up the panoramic image. However, there can still be localized mis-registrations present in the calibration panoramic image that can appear as double images (ghosting) or blurring. One way such distortions can occur derives from the fact that the mosaicing process assumes an idealized camera model. However, in actuality un-modeled radial distortion (i.e., that component of the radial distortion that cannot be adequately modeled in the mosaicing process), tangential distortion, and non-square pixel distortion, among others can cause the local mis-registrations. Further, in regard to the aforementioned calibration images making up the calibration panoramic image, these images were captured by the respective cameras from different viewpoints. While this is irrelevant for objects in the scene that are far away from the cameras, objects that are closer in can create a double image in the panorama. This is referred to as parallax distortion. In the context of the calibration panorama, a close-in object depicted in two of the calibration images captured by adjacent cameras will result in a double image in the overlap region of these images in calibration panoramic image. To compensate for these localized distortions, an estimate of the amount of local mis-registration can be computed and then each image in the panorama can be locally warped to reduce any ghosting or blurring. Preferably, this is accomplished using the procedures described in U.S. Pat. No. 5,986,668 entitled xe2x80x9cDeghosting Method and Apparatus for Construction of Image Mosaicsxe2x80x9d.
Finally, the xe2x80x9cdistortion-correctedxe2x80x9d calibration panoramic image is optionally saved, along with the previously computed focal length and radial distortion estimates, rotation matrices for each camera, and the deghosting correction field, in a calibration file. It is noted that the inclusion of the calibration panoramic image is optional since it is not actually used in the creation of a panoramic video as are the other items constituting the calibration file. These other items are used in the mosaicing process to correct for distortions in the frames of the panoramic video. However, the calibration panoramic image does provide a visual indication of the accuracy of the calibration process, and so can be included for that purpose. It is noted that any time the camera rig or the camera parameters are adjusted, it is necessary to repeat the calibration procedure to produce a new calibration file before authoring a panoramic video.
The panoramic video is generated by first capturing videos of the surrounding scene using the cameras of the previously-described camera rig. Ideally, the cameras would have a synchronization feature by which each camera can be started at the same instant. In this way, the frame numbers of the frames captured by each camera will correspond. For example, the first frame captured by each camera will have been captured at approximately the same time, the second frame captured by each camera would have been captured at approximately the same time, and so on. It is important to know which frames where captured by the cameras at the same time because these corresponding frames will be stitched together into a panoramic image that will form one of the frames of the panoramic video. However, if the cameras do not possess a synchronization feature, an alternate procedure can be performed to find the frames captured by the cameras at the same time. In essence, the procedure involves recording a synchronizing event at the beginning of the recording process and using it to find the corresponding frames. Specifically, all the cameras in the camera rig are set to the record mode. Preferably, the synchronizing event involves bringing an object simultaneously into the view of each pair of adjacent cameras, in turn, or performing some action in the view of each adjacent pair of cameras, in turn. For example, a clapboard could be used for this purpose. The recording of the scene then continues until enough video is captured to produce the desired length panoramic video. It is noted that the scene need not be static during this recording, unlike the recording phase of the calibration procedure. The video captured by each file would be stored as a separate file (e.g., as an .avi file).
While the above-described synchronizing method is preferred, other methods can be employed. For example, an object could be simultaneously brought into view of all the cameras at the same time. This might be accomplished by raising a box or ring up into the field of view of the cameras. However, it is noted that because each camera must xe2x80x9cseexe2x80x9d the object simultaneously, this could be a difficult task. Another synchronizing method that could be employed would be to suddenly change the illumination in the environment to a noticeable degree. This has the advantage that all the cameras would xe2x80x9cseexe2x80x9d this event at the same time. Of course, this last method is only appropriate for indoor environments where control of the illumination is feasible.
Once the videos that are to be made into the panoramic video have been captured, the frame number in each video that coincides with the recorded synchronization event (if present) is identified. In other words, the frame number of the frame in which the synchronization object first appears, or when the synchronization action is performed (e.g., the clap of the clapboard), in each video, would be identified. The relative frame number offsets among the cameras is then computed. Once the frame offsets have been computed, the panoramic frames of the video are generated. This is done by selecting a starting frame from one of the videos. The previously computed offsets are then used to identify the corresponding frame in each of the other videos that was captured at approximately the same moment as the selected starting frame. The first frames identified in each video via the foregoing procedure are then stitched together, preferably using the aforementioned mosaicing process of U.S. Pat. No. 6,018,349, and the parameters saved in the calibration file (i.e., the focal length and radial distortion estimates, the rotation matrices for each camera, and the deghosting correction field). This same process is then repeated for the next consecutive frames in each video to produce each subsequent frame of the panoramic video, until the desired length panoramic video is produced, or there are no more frames left in any one of the videos.
It should be noted, however, that while the deghosting correction field provided in the calibration file will correct much of the camera related distortion, there may be objects close in to the cameras in the videos making up the panoramic video that would result in a double image (i.e., parallax distortion) in one or more of the panorama video frames. Granted, if the scene being recorded in the panoramic video is the same as the scene used to create the aforementioned calibration panoramic image, then some close in objects that are stationary in the scene may be compensated for by the deghosting correction field of the calibration file. However, a different scene may be involved in the panoramic video, and even in cases where the same scene is used, new close-in objects may be present that were not there during the calibration recording. Thus, it is advantageous to perform another deghosting procedure at this point in the panoramic video creation process. The preferred deghosting procedure is the same one described earlier and the subject of U.S. Pat. No. 5,986,668.
Further, it was stated earlier that each camera was to be set at the same exposure setting. However, this can be a difficult task, and it is possible that the video captured by one camera may have a somewhat different exposure level than the video captured by another of the cameras. If this is the case, then an optional exposure adjustment procedure could be applied to correct any mismatch between individual video image frames making up each of the panoramic video frames.
It is noted that the preferred mosaicing procedure results in each panoramic frame of the panoramic video being represented as the individual images plus associated transforms for each. This data must be converted into one of more images that can be readily viewed. Preferably, this is accomplished by using the individual images and associated transformations to construct a series of texture maps for each frame of the panoramic video based on a selected environment model. The shape of the environment model is left up to the author of the panoramic video. The preferred approach for constructing the texture maps is described in U.S. Pat. No. 6,009,190 entitled xe2x80x9cTexture Map Construction Method and Apparatus for Displaying Panoramic Image Mosaicsxe2x80x9d. The resulting texture maps are saved as one or more video files.
The foregoing procedure generates the video texture maps that will be used by a panoramic video viewer to play the panoramic video. The viewer will need certain information to play panoramic videos produced in accordance with the present procedures. Specifically, the viewer will first need the video data associated with the texture maps of the panoramic video frames, or at least a pointer as to where the data can be found or obtained. In addition, the viewer will need to know the shape of the environment model used to create the texture maps. Finally, the viewer should be informed of any navigation boundaries present in the data. For example, if just the panoramic video frames created from the videos captured by the previously described camera rig are provided in the files, then there would be no data associated with regions above and below the longitudinal field of view of the cameras. Thus, the navigation boundary information might describe the maximum up and down angle that the viewer can display to a user. Preferably, the aforementioned information is provided in a data file, which will be referred to for the purposes of the present invention as a .vvi file.
It should be noted in regard to the navigational boundaries, that any gaps in the data, such as the regions above and below the longitudinal field of view of the cameras in the previously-described camera rig, could be filled using static texture maps. This could be accomplished by, in addition to providing the data associated with each frame of the panoramic video, a static texture map (e.g., which could take the form in the .vvi file as a bitmap or pointer thereto) could be provided that would be used by the viewer as an addendum to all the panoramic video frames. For example, suppose the scene captured in the panoramic video was a room. The ceiling of the room would typically not change throughout the video. Therefore, a static image of the ceiling could be provided and displayed by the viewer, should the user want to pan up past the data provided by the panoramic video frames. Further, it should be noted that the present invention is not limited to the exemplary camera rig described earlier. Additional cameras could be added to capture the part of the scene above and below the longitudinal field of view of the cameras used in the exemplary rig. The videos a captured by these cameras would be processed in the same way and included in the texture maps saved in the video data.
It may also be advantageous to compress the data associated with the panoramic video frames. While uncompressed data could be stored on a medium such as a hard drive, CD or DVD, it would be better to compress it if the files are intended to be transferred over a network (e.g., the Internet). It is also noted that an audio track could be added to the panoramic video which could be encoded into the data. The viewer would decode the audio and play it back in conjunction with playing each panoramic video frame. The audio itself could be captured using just one of the video cameras, or it could be a combined audio track composited from sound recorded by some or all the cameras. Further, each video camera in the camera rig could be used to record audio of the environment. The audio data recorded by each camera would be used to represent the environmental sounds that would be heard by a listener facing the portion of the scene contained within that camera""s field of view. Audio data associated with each portion of the scene would be encoded into the texture map data files and assigned to the texture map data representing that portion. In conjunction with this latter embodiment, the panoramic video viewer employed to play the panoramic video would decode it and play back the particular audio assigned to the portion of the scene that is currently being viewed by a user of the system. If the portion of the scene being viewed by the user cuts across the regions captured by adjacent cameras of the camera rig, the viewer could blend the audio assigned to each region and play this composite audio in conjunction with the texture maps associated with that part of the scene.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.