1. Technical Field
The invention is related to video techniques, and more particularly to a system and process for generating a new video sequence from the frames of a finite-length video clip.
2. Background Art
A picture is worth a thousand words. And yet there are many phenomena, both natural and man-made, that are not adequately captured by a single static photo. A waterfall, a flickering flame, a swinging pendulum, a flag flapping in the breezexe2x80x94each of these phenomena has an inherently dynamic quality that a single image simply cannot portray.
The obvious alternative to static photography is video. But video has its own drawbacks. For example, if it is desired to store video on a computer or some other storage device, it is necessary to use a video clip of finite duration. Hence, the video has a beginning, a middle, and an end. Thus, the video becomes a very specific embodiment of a very specific sequence in time. Although it captures the time-varying behavior of the phenomenon at hand, it lacks the xe2x80x9ctimelessxe2x80x9d quality of the photograph. A much better alternative would be to use the computer to generate new video sequences based on the input video clip.
There are current computer graphics methods employing image-based modeling and rendering techniques, where images captured from a scene or object are used as an integral part of the rendering process. To date, however, image-based rendering techniques have mostly been applied to still scenes such as architecture. These existing methods lack the ability to generate new video from images of the scene as would be needed to realize the aforementioned dynamic quality missing from single images.
The ability to generate a new video sequence from a finite video clip parallels somewhat an effort that occurred in music synthesis a decade ago, when sample-based synthesis replaced more algorithmic approaches like frequency modulation. However, to date such techniques have not been applied to video. It is a purpose of the present invention to fill this void with a technique that has been dubbed xe2x80x9cvideo-based renderingxe2x80x9d.
It is noted that in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [1, 2]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention involves a new type of medium, which is in many ways intermediate between a photograph and a video. This new medium, which is referred to as a video texture, can provide a continuous, infinitely varying stream of video images. The video texture is synthesized from a finite set of images by rearranging (and possibly blending) original frames from a source video. While individual frames of a video texture may be repeated from time to time, the video sequence as a whole should never be repeated exactly. Like a photograph, a video texture has no beginning, middle, or end. But like a video, it portrays motion explicitly.
Video textures therefore occupy an interesting niche between the static and the dynamic realm. Whenever a photo is displayed on a computer screen, a video texture might be used instead to infuse the image with dynamic qualities. For example, a web page advertising a scenic destination could use a video texture of palm trees blowing in the wind rather than a static photograph. Or an actor could provide a dynamic xe2x80x9chead shotxe2x80x9d with continuous movement on his home page. Video textures could also find application as dynamic backdrops for scenes composited from live and synthetic elements.
The basic concept of a video texture can be extended in several different ways to further increase its applicability. For backward compatibility with existing video players and web browsers, finite duration video loops can be created to play back without any visible discontinuities. The original video can be split into independently moving regions and each region can be analyzed and rendered independently. It is also possible to use computer vision techniques to separate objects from the background and represent them as video sprites, which can be rendered in arbitrary image locations. Multiple video sprites or video texture regions can be combined into a complex scene.
It would also be possible to put video textures under interactive controlxe2x80x94to drive them at a high level in real time. For instance, by judiciously choosing the transitions between frames of a source video, a jogger can be made to speed up and slow down according to the position of an interactive slider. Or an existing video clip can the shortened or lengthened by removing or adding to some of the video texture in the middle.
Creating video textures and applying them in all of the foregoing ways requires solving a number of problems. The first difficulty is in locating potential transition points in the video sequences, i.e., places where the video can be looped back on itself in a minimally obtrusive way. A second challenge is in finding a sequence of transitions that respects the global structure of the video. Even though a given transition may, itself, have minimal artifacts, it could lead to a portion of the video from which there is no graceful exit, and therefore be a poor transition to take. A third challenge is in smoothing visual discontinuities at the transitions using morphing techniques. A fourth problem is in factoring video frames into different regions that can be analyzed and synthesized independently. Furthermore, various extensions involve additional challenges: the creation of good, fixed-length cycles; separating video texture elements from their backgrounds so that they can be used as video sprites; applying view morphing to video imagery; and generalizing the transition metrics to incorporate real-time user input.
The naxc3xafve approach to the problem of generating video would be to take the input video and loop it, restarting it whenever it has reached the end. Unfortunately since the beginning and the end of the sequence almost never match, a visible motion discontinuity occurs. A simple way to avoid this problem is to search for a frame in the sequence that is similar to the last frame and to loop back to this similar frame to create a repeating single loop video. For certain continually repeating motions, like a swinging pendulum, this approach might be satisfactory. However, for other scenes containing more random motion, the viewer may be able to detect that the motion is being repeated over and over. Accordingly, it would be desirable to generate more variety than just a single loop.
The desired variety can be achieved by producing a more random rearrangement of the frames taken from the input video so that the motion in the scene does not repeat itself over and over in a single loop. Essentially, the video sequence can be thought of as a network of frames linked by transitions. The goal is to find good places to jump from one sequence of frames to another so that the motion appears as smooth as possible to the viewer. One way to accomplish this task is to compute the similarity between each pair of frames of the input video. Preferably, these similarities are characterized by costs that are indicative of how smooth the transition from one frame to another would appear to a person viewing a video containing the frames played in sequence. Further, the cost of transitioning between a particular frame and another frame is computed using the similarity between the next frame in the input video following the frame under consideration. In other words, rather than jumping to a frame that is similar to the current frame under consideration, which would result in a static segment, a jump would be made from the frame under consideration to a frame that is similar to the frame that follows the current frame in the input video. In this way, some of the original dynamics of the input video is maintained.
While the foregoing basic approach can produce acceptably xe2x80x9csmoothxe2x80x9d video for scenes with relatively random motions, such as a candle flame, scenes having more structured, repetitive motions may be problematic. The issue lies in the fact that at the frame level the position of an object moving in a scene in one direction might look very similar to the position of the object moving in the exact opposite direction. For example, consider a swinging pendulum. The images of the pendulum swinging from left to right look very similar to those when the pendulum is swinging from right to left. If a transition is made from a frame depicting the pendulum during its motion from left to right to one depicting the pendulum during its motion from right to left, the resulting video sequence may show the pendulum switching direction in mid-swing. Thus, the transition would not preserve the dynamics of the swinging pendulum.
The previously described process can be improved to avoid this problem and ensure the further preservation of the dynamics of the motion by considering not just the current frame but its neighboring frames as well. For example, by requiring that for a frame in the sequence to be classified as similar to some other frame, not only the frames themselves, but also their neighbors should be similar to each other. One way of accomplishing this is to modify the aforementioned computed costs between each pair of frames by adding in a portion of the cost of transitioning between corresponding neighbors surrounding the frames under consideration. For instance, the similarity value assigned to each frame pair might be a combination of the cost computed for the selected pair as well as the cost computed for the pairs of corresponding frames immediately preceding and immediately following the selected frame pair, where the cost associated with the selected pair is weighted more heavily than the neighboring pairs in the combination. In regard to the pendulum example, the neighboring frames both before and after the similar frames under consideration would be very dissimilar because the pendulum would be moving in opposite directions in these frames and so occupy different positions in the scene. Thus, the combined cost assigned to the pair would indicate a much lower similarity due to the dissimilar neighboring frame pairs. The net result is that the undesirable transitions would no longer have a low cost associated with them. Thus, choosing just those transitions associated with a lower cost would ensure the dynamics of the motion is preserved.
So far, the described process involves determining the costs of transition based on the comparison of a current frame in the sequence (via the following frame) with all other frames. Thus, the decision on how to continue the generated sequence is made without planning ahead on how to continue the sequence in the future. This works well with one exception. It must be remembered that the input video upon which the synthesized video is based has a finite length and so there is always a last frame. At some point in the synthesis of the new video, the last frame will be reached. However, unlike all the previous frames there is no xe2x80x9cnext framexe2x80x9d. Accordingly, a jump must be made to some previous frame. But what if there are no previous frames that would continue the sequence smoothly enough that a viewer would not notice the jump? In such a case the process has run into a xe2x80x9cdead endxe2x80x9d, where any available transition might be visually unacceptable.
It is possible to avoid the dead end issue by improving the foregoing process to recognize that a smoother transition might have been possible from an earlier frame. The process as described so far only takes into account the cost incurred by the present transition, and not those of any future transitions. However, if the cost associated with making a particular transition were modified to account for future costs incurred by that decision, no dead end would be reached. This is because the high cost associated with the transition at the dead end would be reflected in the cost of the transition which would ultimately lead to it. If the future costs associated with making a transition are great enough the transition would no longer be attractive and an alternate, less xe2x80x9ccostlyxe2x80x9d path would be taken. One way of accomplishing the task of accounting for the future transition costs is to sum the previously described cost values with a cost factor based on the total expected cost of the future sequence generated if a certain transition decision is made. To arrive at a stable expression of costs, the future costs would be discounted.
It is noted that the transition cost could also include a user-specified cost factor that would help to minimize the transition costs between frames of the input video clip that depict motion sequences that the user wants in the generated video sequence. It is further noted that, only a selected number of the frames of the input video need be included in the analysis. For example, the number of computations required to compute the cost factors could be minimized by eliminating some less useful frames in the input video from consideration. This would reduce the number of transition costs that have to be computed. Finally, it is noted that the synthesizing process, which will be discussed shortly, could be simplified if the transition costs could be limited to those that are more likely to produce acceptable transitions between frames of the newly generated video sequence. This could be accomplished by computing a course indication of the similarity of two frames first, and computing transition costs for only those frames that are similar enough to produce relatively low transition costs.
The foregoing analysis results in a cost being assigned to potential transitions between frames of the input video. During the synthesis of the desired new video sequence, the basic idea will be to chose only those transitions from frame to frame that are acceptable. Ideally, these acceptable transitions are those that will appear smooth to the viewer. However, even in cases where there is no choice that will produce an unnoticeable transition, it is still desirable to identify the best transitions possible. Certain techniques can be employed to smooth out these rough transitions as will be explained later.
In regard to the synthesis of a continuous, non-looping video sequence, a way of accomplishing the foregoing goals is to map the previously computed transition costs to probabilities through a monotonically decreasing function to characterize the costs via a probability distribution. The probability distribution is employed to identify the potentially acceptable transitions between frames of the input video clip. Prior to actually selecting the order of the frames of the input video that are to be played in a synthesizing process, the number of potentially acceptable transitions that there are to choose from can be pruned to eliminate those that are less desirable and to reduce the processing workload. One possible pruning procedure involves selecting only those transitions associated with local maxima in the probability matrix for a given source and/or destination frame as potentially acceptable transitions. Another pruning strategy involves setting all probabilities below a prescribed minimum probability threshold to zero. It is noted that these two strategies can also be combined by first selecting the transitions associated with the local probability maxima and then setting to zero the probabilities associated with any of the selected transitions that fall below the minimum probability threshold.
Once the frames of the input video clip have been analyzed and a set of acceptable transitions identified, these transitions are used to synthesize the aforementioned continuous, non-looping video sequence. Essentially, synthesizing the video sequence entails specifying an order in which the frames of the input video clip are to be played. More particularly, synthesizing a continuous, non-looping video sequence involves first specifying a starting frame. The starting frame can be any frame of the input video sequence that comes before the frame of the sequence associated with the last non-zero-probability transition. The next frame is then chosen by selecting a frame previously identified as having a potentially acceptable transition between the immediately preceding frame (which in this first instance is the starting frame) and the remaining selected frames. If there is more than one qualifying frame, then one of them is selected at random, according to the previously computed probability distribution. This process is then repeated for as long as the video is running.
For occasions where it is desirable to produce a loopable video having a prescribed length, the synthesizing process is different from that associated with the continuous, non-looping embodiment. In the foregoing analysis process, a cost was assigned to each potential transition between the frames of the input video. These costs are used to synthesize a loopable, fixed length video sequence by first identifying acceptable primitive loops within the input video frames. These acceptable primitive loops are then used to construct compound loops having the desired fixed length. A primitive loop is a sub-sequence of the original video frames that terminates in a jump backwards to the first frame of the sub-sequence. Thus, a primitive loop is a sub-sequence of frames that would run to its last frame and then jump back to its beginning frame. The primitive loops become the basic building blocks for generating the loopable fixed length video sequences. To identify acceptable primitive loops, all the primitive loops that could be formed from the frames of the input video are identified. Once identified, the transition cost of each primitive loop is computed. In regards to computing these loop costs, the previously-discussed future cost computations are not applied when creating the transition cost matrix. Further, in order to reduce the amount of processing required to identify the low cost video loops having the desired length, a transition pruning procedure can be implemented to reduce the number of primitive loops to be considered. Specifically, after pruning all transitions which are not local minima in the difference matrix, the average cost for each transition is computed, and only the best N transitions (and so primitive loop) are considered in the synthesis process. Another method of reducing the number of primitive loops to be considered in building video loops that could be used would entail eliminating all the primitive loops that have average transition costs that exceed a prescribed maximum threshold.
The acceptable primitive loops are combined to form the aforementioned compound loops. A compound loop is a loop made up of primitive loops having overlapping ranges. In other words, each subsequent primitive loop in the compound loop has a beginning sequence (of one or more frames) that is identical to the ending sequence of the preceding primitive loop. A compound loop having the desired length can thus be formed from primitive loops to generate a fixed length sequence. It is noted that a fixed length sequence is loopable, which means that it would end in a smooth transition from the last frame back to the first frame, so that it can be played continuously if desired.
A preferred method for finding a suitable set of primitive loops whose ranges overlap and which sum to the desired length of the compound loop, begins with the use of a dynamic programming procedure. Essentially, this method involves creating a table listing the lowest cost compound loops for each of a set of given loop lengths that contains at least one instance of a particular primitive loop, for each primitive loop of interest. The table can be used to find the compound loop exhibiting the lowest total cost among those listed for a particular loop length. The total cost of a compound loop is simply the sum of the average costs associated with the primitive loops that form the compound loop. After finding the lowest cost compound loop using the dynamic programming method, the primitive loops making up the loop are then sequenced into a legally playable order.
The next phase in the generation of a new video sequence from the frames of the input video clip involves rendering the synthesized video. In regards to the continuous, non-looping video sequence, the new video is rendered by playing the frames of the input video clip in the order specified in the synthesizing process. As the generated video is continuous, the synthesizing process can be on-going with the rendering process. This is possible because the synthesizing process can specify frames to be played faster than they can be played in the rendering process. In regard to the loopable, fixed length sequence embodiment, the primitive loops making up the compound loop defining the fixed-length video and their order were identified in the sequencing procedure described previously. Thus, the rendering of a loopable fixed length video sequence simply involves playing the input video frames in the order indicated in the synthesizing process. This can also include repeating the sequence as many times as desired since the last frame of the synthesized video sequence is designed to acceptably transition back to the first frame.
Although the foregoing process is tailored to identify low cost transitions, and so introduce only small, ideally unnoticeable, discontinuities in the motion, as indicated previously there may be cases where such transitions are not available in the frames of the input video clip. In cases where transitions having costs that will produce noticeable jumps in the synthesized video must be employed, techniques can be applied in the rendering process to disguise the transition discontinuities and make them less noticeable to the viewer. One of the smoothing techniques that could be employed is a conventional blending procedure. This would entail blending the images of the sequence before and after the transition to produce a smoother transition. Preferably, the second sequence would be gradually blended into the first, while both sequences are running using a crossfading procedure. Another smoothing technique that could be employed would be to warp the images towards each other. This technique would prevent the ghosting associated with the crossfade procedure as common features of the images are aligned.
While the foregoing description involves analyzing the frames of the input video clip as a single unit, this need not be the case. For example, some scenes are characterized by multiple, independent (i.e., non-overlapping) motions. While there may not be enough repetitiveness in the motion of such a scene to make the process according to the present invention particularly advantageous when considering the frames of such a video as a whole, each of the regions of independent motion may exhibit the degree of repetitiveness needed. In such cases it would be possible to divide each frame of the input video clip into regions of independent motion. The corresponding regions in each frame are then analyzed and videos are synthesized for each independent motion region, using the previously described processes.
The rendering process associated with a video clip that has been analyzed and synthesized on a regional basis via the independent motion technique includes an additional procedure to create new frames from the extracted regions of the original input video. Essentially, each new frame of the rendered video is created by compositing the independent motion regions from the synthesized independent motion video based on the order of the frames specified in those videos. To avoid seams between the independent motion regions, the boundary areas can be blended together in each composite frame to smooth the transition.
Another example of a scenario where the frames of the input video clip could be advantageously segmented prior to analysis is where the video includes an object that is of interest, but where the rest of the scene is not. The object of interest could be extracted from each frame and a new video sequence of just the object generated using the previously-described processes. It is noted that a video generated in this way is referred to as a video sprite. One use for a video sprite is to insert it into an existing video. This would be accomplished by inserting the frames of the video sprite into the frames of the existing video in corresponding order. The frames of the video sprite would be inserted into the same location of each frame of the existing video. The result would be a new video that includes the object associated with the video sprite.
Another application of the video sprite concept involves objects that move about the scene in the input video clipxe2x80x94such as an animal, vehicle, and person. These objects typically exhibit a generally repetitive motion, independent of their position. Thus, the object could be extracted from the frames of the input video and processed in accordance with the present invention to generate a new video sequence or video sprite of that object. In addition, the translation velocity of the object for each frame would be computed and associated with each frame of the object in the newly generated video. The portion of previously-described analysis involving computing a transition cost between the frames of the input video clip could be modified to add a cost factor based on the difference in velocity of the object between the frames involved. This would tend to influence the selection of acceptable transitions to ensure a smooth translation motion is imparted to the rendered video. The rendering process itself would also be modified to include an additional procedure for inserting the extracted regions depicting the object (i.e. the frames of the video sprite) into a previously derived background image in the order specified by the synthesis procedure, and at a location dictated by a prescribed trajectory of the object in the scene. This can be done by making the centroid of the inserted extracted region correspond with a desired trajectory point. Thus, the generated video would show the object moving naturally about the scene along the prescribed trajectory. This trajectory would mimic that of the object in the input video clip.
Adding sound to video textures is also possible. In essence, sound samples are associated with each frame and played back with the video frames selected to be rendered. To mask any popping effects, the same multi-way cross-fading technique described previously in connection with rendering new video can be employed.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany