Editing video is a time intensive task. Often, editing requires much more time than the actual preparation and filming of a scene. Often editing requires expensive software that requires extensive experience or training to use properly. Accordingly, the ability to edit video content is often out of reach for a typical person. This is unfortunate inasmuch as the ability of the typical person to record video has never been greater due to smart phone cameras, mountable action cameras (e.g. GOPRO cameras), and ever smaller and more affordable handheld video cameras.
Automatically analyzing multiple multimedia materials of different types to compose a new video is a very useful but difficult task. The difficulty comes from two aspects. First, how to define a generic strategy to select the appropriate portions from the inputs is difficult. There are a number of reported research works that are directed to home video or sports video, because domain-specific knowledge is used extensively in the video selection process for these types of video inputs. Also some commercial systems that allow the user to specify multiple video and image inputs to generate a video highlight are available. However, a variety of video genres exist, and hence a generic video selection criterion is more appropriate. Also, supporting multiple input materials of different types is difficult. Typical multimedia inputs consist of at least visual and auditory data in image, video and audio formats. A practical system should be able to analyze all these input formats to generate good video output.
Automatic composing new video from existing material would be a very useful function, and hence has attracted both research attention and industrial effort for many years. In the research domain, automatic home video editing system has been reported where the analysis of home video content for automatic low-quality portion removal and low-speech activity portion detection. Then the remaining portions are concatenated to compose a new video. Similarly, for sports video, most automatic video editing works focus on how to select semantic events from length sports video to compose a game highlight.
There also existing commercial products, such as Google Magisto, that accept user uploaded video and generate a video highlight by selecting portions from the uploaded videos and concatenating them with nice transition effects. Another such product is Muvee, that automatically aligns the visual contents to pre-defined sound track and compose a Music Video. There are, however, limitations to these approaches. First, there exist no generic video selection rules that work across different video genres. For this reason, a sports highlight generation system will not work with home video inputs. Also, visual content usually refers to not only video but also image content. There is some existing work that applies different rules for image and video. However, these require a large number of heuristically tuned parameters.
The systems and methods disclosed herein provide an improved approach for automated editing of video content.