Prior to the background of the invention being set forth herein, it may be helpful to provide definitions of certain terms that will be used hereinafter.
The term “video production” used herein is the process of creating video by capturing moving images (videography) and creating combinations and reductions of parts of this video in live production and post-production (video editing). Video production can be generated from any type of media entity which defines still images as well as video footage of all kinds. In most cases, the captured video will be recorded on electronic media such as video tape, hard disk, or solid state storage, but it might only be distributed electronically without being recorded. It is the equivalent of filmmaking, but with images recorded electronically instead of film stock.
The term “background” as used herein is the part of an image that represents the stationary part of the scene that serves the farthest part of the scene relative to the camera capturing the scene. When the camera is non-stationary, the only movement of the background is due to the movement of the camera. The background may also include moving objects that were recognized as “non-important” based on some importance criteria.
The term “foreground object” as used herein is one or more parts of an image that represent objects that were indicated as “important”, meaning of significance to understanding the scene, and, therefore, foreground objects include also stationary objects such as trees and standing cars as long as they have semantic significance in understanding the scene.
The term “semantic segmentation” or “semantic image segmentation” as used herein is the process linking or mapping each pixel of an image or a video into one of a plurality of physical object classes (e.g., human, car, road, tree), thereby providing an understanding of the scene on the pixel level. When applying a semantic segmentation to an image of a scene, the foreground objects are efficiently segmented from the background allowing to apply various operations on the various objects within the scene on the pixel level. In image processing terminology, the semantic segmentation is indicative of a support of at least one foreground object, meaning all pixels that belong to the at least one foreground object.
More specifically, semantic segmentation is the process of automatically separating between different objects in the scene, and between these objects and the background (which can also be addressed as a background object). The notion ‘semantic’ means that the separation is based on semantic notions, i.e., person, cat, chair, and the like, rather than on low-level visual cues such as edges. The output of the semantic segmentation process is one or more masks, which represent the support of each layer. For example, in the simplest case, there is a single mask, having a value of 1 for the pixels belonging to one of the foreground object classes such as a ‘person’, and 0 for the pixels in the background. These values may also be intermediate values between 0 and 1 (e.g., in the case of soft matting). Usually, there are two types of semantic segmentations: class-based and instance-based segmentations. In class based, all pixels belonging to the same class are assigned to the same segment (e.g., all people in the scene), while in instance-based segmentation, each different instance, e.g., each person is assigned with a different segment). This work deals with both. There is a lot of research in the topic of doing semantic segmentation. Some include regular semantic segmentation and other include instance based semantic segmentation. There are also more transitional ways to compute the segmentation, for example based on motion in a video (separating moving objects from the background).
The term “video transition” or simply “transition” as used herein is a visual operation used during the post-production process of video production in which separate shots are combined in order to present a change in the scene in a manner that other than a mere “cut” between the shots. An example can be a fade-in or fade-out one shot into a consecutive shot.
It would, therefore, be advantageous to be able to automatically generate video transitions or visual effects based on a semantic segmentation of the visual media.