1. Technical Field
The invention is related to video techniques, and more particularly to a system and process for generating a 3D video animation of an object referred to as a 3D Video Texture.
2. Background Art
A picture is worth a thousand words. And yet there are many phenomena, both natural and man-made, that are not adequately captured by a single static photo. A waterfall, a flickering flame, a swinging pendulum, a flag flapping in the breezexe2x80x94each of these phenomena has an inherently dynamic quality that a single image simply cannot portray.
The obvious alternative to static photography is video. But video has its own drawbacks. For example, if it is desired to store video on a computer or some other storage device, it is necessary to use a video clip of finite duration. Hence, the video has a beginning, a middle, and an end. Thus, the video becomes a very specific embodiment of a very specific sequence in time. Although it captures the time-varying behavior of the phenomenon at hand, it lacks the xe2x80x9ctimelessxe2x80x9d quality of the photograph. A much better alternative would be to use the computer to generate new video sequences based on the input video clip.
There are current computer graphics methods employing image-based modeling and rendering techniques, where images captured from a scene or object are used as an integral part of the rendering process. To date, however, image-based rendering techniques have mostly been applied to still scenes such as architecture. These existing methods lack the ability to generate new video from images of the scene as would be needed to realize the aforementioned dynamic quality missing from single images.
The ability to generate a new video sequence from a finite video clip parallels somewhat an effort that occurred in music synthesis a decade ago, when sample-based synthesis replaced more algorithmic approaches like frequency modulation. However, to date such techniques have not been applied to video. It is a purpose of the present invention to fill this void with a technique that has been dubbed xe2x80x9cvideo-based renderingxe2x80x9d.
It is noted that in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [1, 2]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention is related to a new type of medium, which is in many ways intermediate between a photograph and a video. This new medium, which is referred to as a video texture, can provide a continuous, infinitely varying stream of video images. The video texture is synthesized from a finite set of images by rearranging (and possibly blending) original frames from a source video. While individual frames of a video texture may be repeated from time to time, the video sequence as a whole should never be repeated exactly. Like a photograph, a video texture has no beginning, middle, or end. But like a video, it portrays motion explicitly. Video textures therefore occupy an interesting niche between the static and the dynamic realm. Whenever a photo is displayed on a computer screen, a video texture might be used instead to infuse the image with dynamic qualities. For example, a web page advertising a scenic is destination could use a video texture of palm trees blowing in the wind rather than a static photograph. Or an actor could provide a dynamic xe2x80x9chead shotxe2x80x9d with continuous movement on his home page. Video textures could also find application as dynamic backdrops for scenes composited from live and synthetic elements.
Further, the basic concept of a video texture can be extended in several different ways to further increase its applicability. For backward compatibility with existing video players and web browsers, finite duration video loops can be created to play back without any visible discontinuities. The original video can be split into independently moving regions and each region can be analyzed and rendered independently. It is also possible to use computer vision techniques to separate objects from the background and represent them as video sprites, which can be rendered in arbitrary image locations. Multiple video sprites or video texture regions can be combined into a complex scene. It is also possible to put video textures under interactive controlxe2x80x94to drive them at a high level in real time. For instance, by judiciously choosing the transitions between frames of a source video, a jogger can be made to speed up and slow down according to the position of an interactive slider. Or an existing video clip can the shortened or lengthened by removing or adding to some of the video texture in the middle.
The basic concept of the video textures and the foregoing extensions are the subject of the above-identified parent patent application entitled xe2x80x9cVideo-Based Renderingxe2x80x9d. However, the concept of video textures can be extended even further. Particularly, video textures could also be combined with traditional stereo matching and view morphing techniques to produce what will be referred to as xe2x80x9c3D Video Texturesxe2x80x9d. These 3D Video Textures are the subject of the present invention.
A 3D Video Texture can be constructed by first simultaneously videotaping an object from two or more different cameras positioned at different locations. Video from one of the cameras is used to extract, analyze and synthesize a video sprite of the object of interest using the previously described methods. In addition, the first, contemporaneous, frames captured by at least two of the cameras are used to estimate a 3D depth map of the scene. The background of the scene contained within the depth map is then masked out, using a conventional background subtraction procedure, and a clear shot of the scene background taken before filming of the object began, leaving just the object. To generate each new frame in the 3D video animation, the extracted region making up a xe2x80x9cframexe2x80x9d of the video sprite is mapped onto the previously generated 3D surface. The resulting image is rendered from a novel viewpoint, and then mapped into an appropriate 3D scene depiction. This depiction could be a flat image of the aforementioned separately filmed background which has been warped to the correct location, or it could be a depiction of a new scene created to act as the background for the 3D video texture.; Further, it is noted that more than one 3D video texture could be created as described above, and then each mapped into different locations of the same 3D scene depiction.
It is also noted that in cases where it is anticipated that the subject could move frequently, the foregoing part of the procedure associated with estimating a 3D depth map of the scene and extracting the 3D surface representation of the object from the depth map could be repeated for each subsequent set of contemporaneous frames captured by at least two of the cameras. Then, each new frame in the 3D video animation would be generated by mapping the frame of the video sprite onto the 3D surface representation created, in part, from the video frame used to generate that frame of the video sprite. In this way, any movement by the-subject is compensated for in the resulting 3D Video Texture.
Before describing the particular embodiments of the present invention, it is useful to understand the basic concepts associated with video textures. The naive approach to the problem of generating video would be to take the input video and loop it, restarting it whenever it has reached the end. Unfortunately since the beginning and the end of the sequence almost never match, a visible motion discontinuity occurs. A simple way to avoid this problem is to search for a frame in the sequence that is similar to the last frame and to loop back to this similar frame to create a repeating single loop video. For certain continually repeating motions, like a swinging pendulum, this approach might be satisfactory. However, for other scenes containing more random motion, the viewer may be able to detect that the motion is being repeated over and over. Accordingly, it would be desirable to generate more variety than just a single loop.
The desired variety can be achieved by producing a more random rearrangement of the frames-taken from the input video so that the motion in the scene does not repeat itself over and over in a single loop. Essentially, the video sequence can be thought of as a network of frames linked by transitions. The goal is to find good places to jump from one sequence of frames to another so that the motion appears as smooth as possible to the viewer. One way to accomplish this task is to compute the similarity between each pair of frames of the input video. Preferably, these similarities are characterized by costs that are indicative of how smooth the transition from one frame to another would appear to a person viewing a video containing the frames played in sequence. Further, the cost of transitioning between a particular frame and another frame is computed using the similarity between the next frame in the input video following the frame under consideration. In other words, rather than jumping to a frame that is similar to the current frame under consideration, which would result in a static segment, a jump would be made from the frame under consideration to a frame that is similar to the frame that follows the current frame in the input video. In this way, some of the original dynamics of the input video is maintained.
While the foregoing basic approach can produce acceptably xe2x80x9csmoothxe2x80x9d video for scenes with relatively random motions, such as a candle flame, scenes having more structured, repetitive motions may be problematic. The issue lies in the fact that at the frame level the position of an object moving in a scene in one direction might look very similar to the position of the object moving in the exact opposite direction. For example, consider a swinging pendulum. The images of the pendulum swinging from left to right look very similar to those when the pendulum is swinging from right to left. If a transition is made from a frame depicting the pendulum during its motion from left to right to one depicting the pendulum during its motion from right to left, the resulting video sequence may show the pendulum switching direction in mid-swing. Thus, the transition would not preserve the dynamics of the swinging pendulum.
The previously, described process can be improved to avoid this problem and ensure the further preservation of the dynamics"" of the motion by considering not just the current frame but its neighboring frames as well. For example, by requiring that for a frame in the,sequence to be classified as similar to some other frame, not only the frames themselves, but also their neighbors should be similar to each other. One way of accomplishing this is to modify the aforementioned computed costs between each pair of frames by adding in a portion of the cost of transitioning between corresponding neighbors surrounding the frames under consideration. For instance, the similarity value assigned to each frame pair might be a combination of the cost computed for the selected pair as well as the cost computed for the pairs of corresponding frames immediately preceding and immediately following the selected frame pair, where the cost associated with the selected pair is weighted more heavily than the neighboring pairs in the combination. In regard to the pendulum example, the neighboring frames both before and after the similar frames under consideration would be very dissimilar because the pendulum would be moving in opposite directions in these frames and so occupy different positions in the scene. Thus, the combined cost assigned to the pair would indicate a much lower similarity due to the dissimilar neighboring frame pairs. The net result is that the undesirable transitions would no longer have a low cost associated with them. Thus, choosing just those transitions associated with a lower cost would ensure the dynamics of the motion is preserved.
So far, the described process involves determining the costs of transition based on the comparison of a current frame in the sequence (via the following frame) with all other frames. Thus, the decision on how to continue the generated sequence is made without planning ahead on how to continue the sequence in the future. This works well with one exception. It must be remembered that the input video upon which the synthesized video is based has a finite length and so there is always a last frame. At some point in the synthesis of the new video, the last frame will be reached. However, unlike all the previous frames there is no xe2x80x9cnext framexe2x80x9d. Accordingly, a jump must be made to some previous frame. But what if there are no previous frames that would continue the sequence smoothly enough that a viewer would not notice the jump? In such a case the process has run into a xe2x80x9cdead endxe2x80x9d, where any available transition might be visually unacceptable.
It is possible to avoid the dead end issue by improving the foregoing process to recognize that a smoother transition might have been possible from an earlier frame. The process as described so far only takes into account the cost incurred by the present transition, and not those of any future transitions. However, if the cost associated with making a particular transition were modified to account for future costs incurred by that decision, no dead end would be reached. This is because the high cost associated with the transition at the dead end would be reflected in the cost of the transition which would ultimately lead to it. If the future costs associated with making a transition are great enough the transition would no longer be attractive and an alternate, less xe2x80x9ccostlyxe2x80x9d path would be taken. One way of accomplishing the task of accounting for the future transition costs is to sum the previously described cost values with a cost factor based on the total expected cost of the future sequence generated if a certain transition decision is made. To arrive at a stable expression of costs, the future costs would be discounted.
It is noted that the transition cost could also include a user specified cost factor that would help to minimize the transition costs between frames of the input video clip that depict motion sequences that the user wants in the generated video sequence. It is further noted that, only a selected number of the frames of the input video need be included in the analysis. For example, the number of computations required to compute the cost factors could be minimized by eliminating some less useful frames in the input video from consideration. This would reduce the number of transition costs that have to be computed. Finally, it is noted that the synthesizing process, which will be discussed shortly, could be simplified if the transition costs could be limited to those that are more likely to produce acceptable transitions between frames of the newly generated video sequence. This could be accomplished by computing a course indication of the similarity of two frames first, and computing transition costs for only those frames that are similar enough to produce relatively low transition costs.
The foregoing analysis results in a cost being assigned to potential transitions between frames of the input video. During the synthesis of the desired new video sequence, the basic idea will be to chose only those transitions from frame to frame that are acceptable. Ideally, these acceptable transitions are those that will appear smooth to the viewer. However, even in cases where there is no choice that will produce an unnoticeable transition, it is still desirable to identify the best transitions possible. Certain techniques can be employed to smooth out these rough transitions as will be explained later.
In regard to the synthesis of a continuous, non-looping video sequence, a way of accomplishing the foregoing goals is to map the previously computed transition costs to probabilities through a monotonically decreasing function to characterize the costs via a probability distribution. The probability distribution is employed to identify the potentially acceptable transitions between frames of the input video clip. Prior to actually selecting the order of the frames of the input video that are to be played in a synthesizing process, the number of potentially acceptable transitions that there are to choose from can be pruned to eliminate those that are less desirable and to reduce the processing workload. One possible pruning procedure involves selecting only those transitions associated with local maxima in the probability matrix for a given source and/or destination frame as potentially acceptable transitions. Another pruning strategy involves setting all probabilities below a prescribed minimum probability threshold to zero. It is noted that these two strategies can also be combined by first selecting the transitions associated with the local probability maxima and then setting to zero the probabilities associated with any of the selected transitions that fall below the minimum probability threshold.
Once the frames of the input video clip have been analyzed and a set of acceptable transitions identified, these transitions are used to synthesize the aforementioned continuous, non-looping video sequence. Essentially, synthesizing the video sequence entails specifying an order in which the frames of the input video clip are to be played. More particularly, synthesizing a continuous, non-looping video sequence involves first specifying a starting frame. The starting frame can be any frame of the input video sequence that comes before the frame of the sequence associated with the last non-zero-probability transition. The next frame is then chosen by selecting a frame previously identified as having a potentially acceptable transition between the immediately preceding frame (which in this first instance is the starting frame) and the remaining selected frames. If there is more than one qualifying frame, then one of them is selected at random, according to the previously computed probability distribution. This process is then repeated for as long as the video is running.
For occasions where it, is desirable to produce a loopable video having a prescribed length, the synthesizing process is different from that associated with the continuous, non-looping embodiment. In the foregoing analysis process, a cost was assigned to each potential transition between the frames of the input video. These costs are used to synthesize a loopable, fixed length video sequence by first identifying acceptable primitive loops within the input video frames. These acceptable primitive loops are then used to construct compound loops having the desired fixed length. A primitive loop is a sub-sequence of the original video frames that terminates in a jump backwards to the first frame of the sub-sequence. Thus, a primitive loop is a sub-sequence of frames that would run to its last frame and then jump back to its beginning frame. The primitive loops become the basic building blocks for generating the loopable fixed length video sequences. To identify acceptable primitive loops, all the primitive loops, that could be formed from the frames of the input video are identified. Once identified, the transition cost of each primitive loop is computed. In regards to computing these loop costs, the previously discussed future cost computations are not applied when creating the transition cost matrix. Further, in order to reduce the amount of processing required to identify the low cost video loops having the desired length, a transition pruning procedure can be implemented to reduce the number of primitive loops to be considered. Specifically, after pruning all transitions which are not local minima in the difference matrix, the average cost for each transition is computed, and only the best N transitions (and so primitive loops) are considered in the synthesis process. Another method of reducing the number of primitive loops to be considered in building video loops that could be used would entail eliminating all the primitive loops that have average transition costs that exceed a prescribed maximum threshold.
The acceptable primitive loops are combined to form the aforementioned compound loops. A compound loop is a loop made up of primitive loops having overlapping ranges. In other words, each subsequent primitive loop in the compound loop has a beginning sequence (of one or more frames) that is identical to the ending sequence of the preceding primitive loop. A compound loop having the desired length can thus be formed from primitive loops to generate a fixed length sequence. It is noted that a fixed length sequence is loopable, which means that it would, end in a smooth transition from the last frame back to the first frame, so that it can be played continuously if desired.
A preferred method for finding a suitable set of primitive loops whose ranges overlap and which sum to the desired length of the compound loop, begins with the use of a dynamic programming procedure. Essentially, this method involves creating a table, listing the lowest cost compound loops for each of a set of given loop lengths that contains at least one instance of a particular primitive loop, for each primitive loop of interest. The table can be used to find the compound loop exhibiting the lowest total cost among those listed for a particular loop length. The total cost of a compound loop is simply the sum of the average costs associated with the primitive loops that form the compound loop. After finding the lowest cost compound loop using the dynamic programming method, the primitive loops making up the loop are then sequenced into a legally playable order.
The next phase in the generation of a new video sequence from the frames of the input video clip involves rendering the synthesized video. In regards to the continuous, non-looping video sequence, the new video is rendered by playing the frames of the input video clip in the order specified in the synthesizing process. As the generated video is continuous, the synthesizing process can be on-going with the rendering process. This is possible because the synthesizing process can specify frames to be played faster than they can be played in the rendering process. In regard to the loopable, fixed length sequence embodiment, the primitive loops making up the compound loop defining the fixed-length video and their order were identified in the sequencing procedure described previously. Thus, the rendering of a loopable fixed length video sequence simply involves playing the input video frames in the order indicated in the synthesizing process. This can also include repeating the sequence as many times as desired since the last frame of the synthesized video sequence is designed to acceptably transition back to the first frame.
Although the foregoing process is tailored to identify low cost transitions, and so introduce only small, ideally unnoticeable, discontinuities in the motion, as indicated previously there may be cases where such transitions are not available in the frames of the input video clip. In cases where transitions having costs that will produce noticeable jumps in the synthesized video must be employed, techniques can be applied in the rendering process to disguise the transition discontinuities and make them less noticeable to the viewer. One of the smoothing techniques that could be employed is a conventional blending procedure. This would entail blending the images of the sequence before and after the transition to produce a smoother transition. Preferably, the second sequence would be gradually blended into the first, while both sequences are running using a crossfading procedure. Another smoothing technique that could be employed would be to warp the images towards each other. This technique would prevent the ghosting associated with the crossfade procedure as common features of the images are aligned.
Of particular interest for the present invention, is the extension of the basic video textures concept involving the generation of video sprites. While the foregoing description described the analysis of the frames of the input video clip as a single unit, this need not be the case. Rather, the frames of the input video clip could be advantageously segmented prior to analysis where the video includes a object that is of interest, but where the rest of the scene is not. The object of interest could be extracted from each frame and a new video sequence of just the object generated using the previously-described processes. A video generated in this way is a video sprite. It is these video sprites that are used as described previously to produce a 3D Video Textures.