Deep generative models have recently received an increasing amount of attention, not only because deep generative models provide a means to learn deep feature representations in an unsupervised manner that can potentially leverage all the unlabeled images on Internet for training, but also because they can be used to generate novel images useful for various vision applications. As steady progress toward better image generation is made, it is also important to study the video generation problem. However, the extension from generating images to generating videos turns out to be a highly challenging task, although the generated data has just one more dimension—the time dimension.
The video generation problem may be a much harder problem for the following reasons. First, since a video is a spatio-temporal recording of visual information of objects performing various actions, a generative model needs to learn the plausible physical motion models of objects in addition to learning appearance models for the objects. If the learned object motion model is incorrect, the generated video may contain objects performing physically impossible motion. Second, the time dimension brings in a huge amount of variations. Consider the speed variations that a person can have as performing a squat movement. Each speed pattern results in a different video, although the appearances of the human in the videos are the same. Third, as human beings have evolved to be rather sensitive to motion, motion artifacts are particularly perceptible.
There is a need for addressing these issues and/or other issues associated with the prior art.