1. Technical Field
The invention is related to video techniques, and more particularly to a system and process for generating a video animation from the frames of a video sprite.
2. Background Art
A picture is worth a thousand words. And yet there are many phenomena, both natural and man-made, that are not adequately captured by a single static photo. A waterfall, a flickering flame, a swinging pendulum, a flag flapping in the breezexe2x80x94each of these phenomena has an inherently dynamic quality that a single image simply cannot portray.
The obvious alternative to static photography is video. But video has its own drawbacks. For example, if it is desired to store video on a computer or some other storage device, it is necessary to use a video clip of finite duration. Hence, the video has a beginning, a middle, and an end. Thus, the video becomes a very specific embodiment of a very specific sequence in time. Although it captures the time-varying behavior of the phenomenon at hand, it lacks the xe2x80x9ctimelessxe2x80x9d quality of the photograph. A much better alternative would be to use the computer to generate new video sequences based on the input video clip.
There are current computer graphics methods employing image-based modeling and rendering techniques, where images captured from a scene or object are used as an integral part of the rendering process. To date, however, image-based rendering techniques have mostly been applied to still scenes such as architecture. These existing methods lack the ability to generate new video from images of the scene as would be needed to realize the aforementioned dynamic quality missing from single images.
The ability to generate a new video sequence from a finite video clip parallels somewhat an effort that occurred in music synthesis a decade ago, when sample-based synthesis replaced more algorithmic approaches like frequency modulation. However, to date such techniques have not been applied to video. It is a purpose of the present invention to fill this void with a technique that has been dubbed xe2x80x9cvideo-based renderingxe2x80x9d.
It is noted that in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [1, 2]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention is related to a new type of medium, which is in many ways intermediate between a photograph and a video. This new medium, which is referred to as a video texture, can-provide a continuous, infinitely varying stream of video images. The video texture is synthesized from a finite set of images by rearranging (and possibly blending) original frames from a source video. While individual frames of a video texture may be repeated from time to time, the video sequence as a whole should never be repeated exactly. Like a photograph, a video texture has no beginning, middle, or end. But like a video, it portrays motion explicitly. Video textures therefore occupy an interesting niche between the static and the dynamic realm. Whenever a photo is displayed on a computer screen, a video texture might be used instead to infuse the image with dynamic qualities. For example, a web page advertising a scenic destination could use a video texture of palm trees blowing in the wind rather than a static photograph. Or an actor could provide a dynamic xe2x80x9chead shotxe2x80x9d with continuous movement on his home page. Video textures could also find application as dynamic backdrops for scenes composited from live and synthetic elements.
Further, the basic concept of a video texture can be extended in several different ways to further increase its applicability. For backward compatibility with existing video players and web browsers, finite duration video loops can be created to play back without any visible discontinuities. The original video can be split into independently moving regions and each region can be analyzed and rendered independently. It is also possible to use computer vision techniques to separate objects from the background and represent them as video sprites, which can be rendered in arbitrary image locations. Multiple video sprites or video texture regions can be combined into a complex scene. It is also possible to put video textures under interactive controlxe2x80x94to drive them at a high level in real time. For instance, by judiciously choosing the transitions between frames of a source video, a jogger can be made to speed up and slow down according to the position of an interactive slider. Or an existing video clip can the shortened or lengthened by removing or adding to some of the video texture in the middle.
The basic concept of the video textures and the foregoing extensions are the subject of the above-identified parent patent application entitled xe2x80x9cVideo-Based Renderingxe2x80x9d. However, the concept of video textures can be extended even further. For example, another application of the video sprite concept involves objects that move about the scene in the input video clip-such as an animal, vehicle, and person. These objects typically exhibit a generally repetitive motion, independent of their position. Thus, the object could be extracted from the frames of the input video and processed to generate a new video sequence or video sprite of that object. This video sprite would depict the object as moving in place. Further, the frames of the video sprite could be inserted into a previously derived background image (or frames of a background video) at a location dictated by a prescribed path of the object in the scene. In this regard, a user of the system could be allowed to specify the path of the object, or alternately cause a path to generated and input into the system. It is this extension of the basic video textures concept that the present invention is directed toward.
Before describing the particular embodiments of the present invention, it is useful to understand the basic concepts associated with video textures. The naive approach to the problem of generating video would be to take the input video and loop it, restarting it whenever it has reached the end. Unfortunately since the beginning and the end of the sequence almost never match, a visible motion discontinuity occurs. A simple way to avoid this problem is to search for a frame in the sequence that is similar to the last frame and to loop back to this similar frame to create a repeating single loop video. For certain continually repeating motions, like a swinging pendulum, this approach might be satisfactory. However, for other scenes containing more random motion, the viewer may be able to detect that the motion is being repeated over and over. Accordingly, it would be desirable to generate more variety than just a single loop.
The desired variety can be achieved by producing a more random rearrangement of the frames taken from the input video so that the motion in the scene does not repeat itself over and over in a single loop. Essentially, the video sequence can be thought of as a network of frames linked by transitions. The goal is to find good places to jump from one sequence of frames to another so that the motion appears as smooth as possible to the viewer. One way to accomplish this task is to compute the similarity between each pair of frames of the input video. Preferably, these similarities are characterized by costs that are indicative of how smooth the transition from one frame to another would appear to a person viewing a video containing the frames played in sequence. Further, the cost of transitioning between a particular frame and another frame is computed using the similarity between the next frame in the input video following the frame under consideration. In other words, rather than jumping to a frame that is similar to the current frame under consideration, which would result in a static segment, a jump would be made from the frame under consideration to a frame that is similar to the frame that follows the current frame in the input video. In this way, some of the original dynamics of the input video is maintained.
While the foregoing basic approach can produce acceptably xe2x80x9csmoothxe2x80x9d video for scenes with relatively random motions, such as a candle flame, scenes having more structured, repetitive motions may be problematic. The issue lies in the fact that at the frame level the position of an object moving in a scene in one direction might look very similar to the position of the object moving in the exact opposite direction. For example, consider a swinging pendulum. The images of the pendulum swinging from left to right look very similar to those when the pendulum is swinging from right to left. If a transition is made from a frame depicting the pendulum during its motion from left to right to one depicting the pendulum during its motion from right to left, the resulting video sequence may show the pendulum switching direction in mid-swing. Thus, the transition would not preserve the.dynamics of the swinging pendulum.
The previously described process can be improved to avoid this problem and ensure the further preservation of the dynamics of the motion by considering not just the current frame but its neighboring frames as well. For example, by requiring that for a frame in the sequence to be classified as similar to some other frame, not only the frames themselves, but also their neighbors should be similar to each other. One way of accomplishing this is to modify the aforementioned computed costs between each pair of frames by adding in a portion of the cost of transitioning between corresponding neighbors surrounding the frames under consideration. For instance, the similarity value assigned to each frame pair might be a combination of the cost computed for the selected pair as well as the cost computed for the pairs of corresponding frames immediately preceding and immediately following the selected frame pair, where the cost associated with the selected pair is weighted more heavily than the neighboring pairs in the combination. In regard to the pendulum example, the neighboring frames both before and after the similar frames under consideration would be very dissimilar because the pendulum would be moving in opposite directions in these frames and so occupy different positions in the scene. Thus, the combined cost assigned to the pair would indicate a much lower similarity due to the dissimilar neighboring frame pairs. The net result is that the undesirable transitions would no longer have a low-cost associated with them. Thus, choosing just those transitions associated with a lower cost would ensure the dynamics of the motion is preserved.
So far, the described process involves determining the costs of transition based on the comparison of a current frame in the sequence (via the following frame) with all other frames. Thus, the decision on how to continue the generated sequence is made without planning ahead on how to continue the sequence in the future. This works well with one exception. It must be remembered that the input video upon which the synthesized video is based has a finite length and so there is always a last frame. At some point in the synthesis of the new video, the last frame will be reached. However, unlike all the previous frames there is no xe2x80x9cnext framexe2x80x9d. Accordingly, a jump must be made to some previous frame. But what if there are no previous frames that would continue the sequence smoothly enough that a viewer would not notice the jump? In such a case the process has run into a xe2x80x9cdead endxe2x80x9d, where any available transition might be visually unacceptable.
It is possible to avoid the dead end issue by improving the foregoing process to recognize that a smoother transition might have been possible from an earlier frame. The process as described so far only takes into account the cost incurred by the present transition, and not those of any future transitions. However, if the cost associated with making a particular transition were modified to account for future costs incurred by that decision, no dead end would be reached. This is because the high cost associated with the transition at the dead end would be reflected in the cost of the transition which would ultimately lead to it. If the future costs associated with making a transition are great enough the transition would no longer be attractive and an alternate, less xe2x80x9ccostlyxe2x80x9d path would be taken. One way of accomplishing the task of accounting for the future transition costs is to sum the previously described cost values with a cost factor based on the total expected cost of the future sequence generated if a certain transition decision is made. To arrive at a stable expression of costs, the future costs would be discounted.
The foregoing analysis results in a cost being assigned to potential transitions between frames of the input video. During the synthesis of the desired new video sequence, the basic idea will be to choose only those transitions from frame to frame that are acceptable. Ideally, these acceptable transitions are those that will appear smooth to the viewer. However, even in cases where-there is no choice that will produce an unnoticeable transition, it is still desirable to identify the best transitions possible. Certain techniques can be employed to smooth out these rough transitions as will be explained later.
In regard to the synthesis of a continuous, non-looping video sequence, a way of accomplishing the foregoing goals is to map the previously computed transition costs to probabilities through a monotonically decreasing function to characterize the costs via a probability distribution. The probability distribution is employed to identify the potentially acceptable transitions between frames of the input video clip. Prior to actually selecting the order of the frames of the input video that are to be played in a synthesizing process, the number of potentially acceptable transitions that there are to choose from can be pruned to eliminate those that are less desirable and to reduce the processing workload. One possible pruning procedure involves selecting only those transitions associated with local maxima in the probability matrix for a given source and/or destination frame as potentially acceptable transitions. Another pruning strategy involves setting to zero all probabilities below a prescribed minimum probability threshold. It is noted that these two strategies can also be combined by first selecting the transitions associated with the local probability maxima and then setting the probabilities associated with any of the selected transitions that fall below the minimum probability threshold to zero.
Once the frames of the input video clip have been analyzed and a set of acceptable transitions identified, these transitions are used to synthesize the aforementioned continuous, non-looping video sequence. Essentially, synthesizing the video sequence entails specifying an order in which the frames of the input video clip are to be played. More particularly, synthesizing a continuous, non-looping video sequence involves first specifying a starting frame. The starting frame can be any frame of the input video sequence that comes before the frame of the sequence associated with the last non-zero-probability transition. The next frame is then chosen by selecting a frame previously identified as having a potentially acceptable transition between the immediately preceding frame (which in this first instance is the starting frame) and the remaining selected frames. If there is more than one qualifying frame, then one of them is selected at random, according to the previously computed probability distribution. This process is then repeated for as long as the video is running.
The next phase in the generation of a new video sequence from the frames of the input video clip involves rendering the synthesized video. In regards to the continuous, non-looping video sequence, the new video is rendered by playing the frames of the input video clip in the order specified in the synthesizing process. As the generated video is continuous, the synthesizing process can be on-going with the rendering process. This is possible because the synthesizing process can specify frames to be played faster than they can be played in the rendering process.
Although the foregoing process is tailored to identify low cost transitions, and so introduce only small, ideally unnoticeable, discontinuities in the motion, as indicated previously there may be cases where such transitions are not available in the frames of the input video clip. In cases where transitions having costs that will produce noticeable jumps in the synthesized video must be employed, techniques can be applied in the rendering process to disguise the transition discontinuities and make them less noticeable to the viewer. One of the smoothing techniques that could be employed is a conventional blending procedure. This would entail blending the images of the sequence before and after the transition to produce a smoother transition. Preferably, the second sequence would be gradually blended into the first, while both sequences are running using a crossfading procedure. Another smoothing technique that could be employed would be to warp the images towards each other. This technique would prevent the ghosting associated with the crossfade procedure as common features of the images are aligned.
While the foregoing description involves analyzing the frames of the input video clip as a single unit, this need not be the case. For example, the frames of the input video clip could be advantageously segmented prior to analysis where the video includes a object that is of interest, but where the rest of the scene is not. The object of interest could be extracted from each frame and a new video sequence of just the object generated using the previously-described processes. It is noted that a video generated in this way is referred to a video sprite. One use for a video sprite is to insert it into an existing video. This would be accomplished by inserting the frames of the video sprite into the frames of the existing video in corresponding order. The frames of the video sprite would be inserted into the same location of each frame of the existing video. The result would be a new video that includes the object associated with the video sprite.
As mentioned previously, an object could be extracted from the frames of the input video and processed in accordance with the present invention to generate a new video sequence or video sprite of that object. In addition, the translation velocity of the object for each frame would be computed and associated with each frame of the video sprite. The portion of previously-described analysis involving computing a transition cost between the frames of the input video clip could be modified to add a cost factor based on the difference in velocity of the object between the frames involved. This would tend to influence the selection of acceptable transitions to ensure a smooth translation motion is imparted to the rendered video. The rendering process itself would also be modified to include an additional procedure for inserting the extracted regions depicting the object (i.e. the frames of the video sprite) into a previously derived background image, or a frame of a background video, in the order specified by the synthesis procedure. Each video sprite frame is inserted at a location dictated by a prescribed path of the object in the scene and the velocity associated with the object in the selected video sprite frame. This can be done by making the centroid of the inserted extracted region correspond with a desired path point. Thus, the generated video, which is referred to as a video animation, would show the object moving naturally about the scene along the prescribed path. This path could mimic that of the object in the input video clip, or it could be prescribed by a user.
With regard to the option of a user prescribing the path, one embodiment of the present video based rendering system and process would involve the user specifying successive points along a desired path through a background scene. For example, the user could select points in a background image, or a frame of a background video, displayed on a computer monitor. This could be done on a point-by-point basis, or the user could move a cursor along a desired path that the object of interest is to take in the new video animation. In the latter case, points along the traced path would preferably be recorded and used to define the path. Frames of the video sprite showing the object of interest would be selected and inserted in a background image or frame along the user-specified path. As with the previous embodiment, the velocity of the object in the selected frames would be taken into consideration.
More specifically, the so-called user-controlled movement embodiment involving a user-specified path can be implemented as follows. First, a video sprite of an object it is desired to feature in the video animation is input into the system, along with a user-specified path. Next, one of the frames of the video sprite is selected as the first frame, and inserted into a frame of an existing video sequence at a point on the user-specified path, to produce the first frame of the video animation. The existing video sequence can simply be multiple copies of the same background image, or a frame of a background video which changes over time. The previously-selected frame of the video sprite is then compared to the other video sprite frames to identify potentially acceptable transitions between the selected frame and the other frames, and a video sprite frame that was identified as corresponding to an acceptable transition from the last-selected frame is selected. This frame is designated as the currently selected video sprite frame in lieu of the last-selected frame. The new currently-selected frame is then inserted into the next consecutive frame of the aforementioned existing video sequence at a point along the user-specified path dictated by the velocity associated with the object in the last-inserted frame. The result of the insertion action is the creation of the next frame of the animated video. The currently-selected frame of the video sprite is next compared to the other video sprite frames to identify potentially acceptable transitions between it and the other frames, just as was done with the first video sprite frame. The foregoing process of selecting, inserting and comparing video sprite frames to create successive frames of the video animation continues for as long as it is desired to produce new frames of the video animation.
The above-described process actions involving comparing a selected video sprite frame with all the other video sprite frames to identify acceptable transitions therebetween is preferably accomplished as follows. First, the translation velocity associated with the object for each of the frames of the video sprite is computed. These velocities are used to compute a velocity cost indicative of the difference in the object""s velocity between the currently selected frame (which may be the first frame) and each of the other video sprite frames. In addition, an image similarity cost associated with transitioning from the selected frame to each of the other frames is computed. Next, an error cost related to the user-specified path is computed between the selected video sprite frame and each of the other frames. This error cost is a function of the distance between the next recorded point in the user-specified path and the current position of the object in the path, as well as the velocity of the particular xe2x80x9cotherxe2x80x9d frame under consideration. An anticipated future transition cost representative of the transition costs that would be incurred if the transition between the selected video sprite frame and each of the other frames were implemented is also respectively computed for each of the other frames. The velocity cost, image similarity cost, error cost and future transition- cost are added together to produce a directed future cost for the transition between the selected video sprite frame and each of the other video sprite frames. These directed future costs are then mapped to probability values using a monotonically decreasing function to produce a probability distribution for the costs. And finally, those video sprite frames that are associated with a transition having a probability maximum between the selected video sprite frame and the other video sprite frames are designated as corresponding to an acceptable transition.
It is noted that the foregoing user-controlled motion video rendering system and process can also be implemented without the path being directly specified by a user. Rather, the path could be generated in other ways and input into the present system.