The present disclosure is generally directed to video editing systems and, more particularly, to a system and method for creating a composite video work.
Traditional non-linear digital video editing systems create output clips frame-by-frame, by reading input clips, performing transformations, rendering titles or effects, and then writing individual frames to an output file. This output file must then be streamed to media consumers.
There are several problems with this approach. First, to splice multiple videos together into an edited video, all video files must be stored locally, and must be of sufficiently high quality that recompression for re-streaming will not result in noticeable quality loss. Second, when the edited video is created, it must be stored in addition to the input clips, and that consumes video space proportional to its length. Creating multiple edits of the same input videos consumes additional storage. This makes mass customization impractical. Third, when the input videos are composited to create the output video, every frame of the output must be rendered at the exact frame size and format of the output video. This requires that input videos using different resolutions, color spaces, and frame-rates be upscaled, downscaled, color-space converted, and/or re-timed to match the output media type. Finally, even if the original videos are available via network streams, delivering the edited output video to a consumer requires that the output video be hosted (served on a network) as well.
There is a technology component in Windows XP® software called the Video Mixing Renderer 9 (VMR9), part of the DirectShow® API. In DirectShow®, all streaming media files are played by constructs called “filter graphs,” in which a directed graph is created of several media “filters.” For example: This graph might start with a “file reader filter” (or a “network reader filter,” in a network streaming case) to define an AVI input stream of bits (from disk or network, respectively). This stream then passes through an AVI splitter filter to convert the AVI format file into a series of raw media streams, followed by a video decoder filter to convert compressed video into uncompressed RGB (or YUV) video buffers, and finally a video renderer to actually draw the video on the screen.
The Microsoft VMR9 is a built-in proprietary video renderer that draws video frames to Direct3D® hardware surfaces. A “surface” is an image that is (typically) stored entirely in ultra-high-performance graphics controller memory, and can be drawn onto one or more triangles as part of a fully hardware-accelerated rendering pipeline. The primary goal of the VMR9 is to allow video to be rendered into these surfaces, then delivered to the application hosting the VMR9's filter graph for inclusion in a Direct3D® rendered scene. The advantage of this approach is that many highly cpu-intensive operations, such as de-interlacing the output video, re-sizing it (using bilinear or bicubic resampling), color correcting it, etc., are all performed virtually for free by modern consumer graphics hardware, and most of these operations are complete before the video surface even becomes available to the application programmer.
The VMR9 has a mode of operation called “mixing mode,” in which a small number of video streams can be “mixed,” or composited, together at rendering time. The streams can vary in frame size, frame rate, and other media-type parameters. When frames are issued to the renderer by upstream filters (such as the compressed video decoder), it composites the frames together and generates a single Direct3D® surface containing the composite. The user can control alpha channel values, source and destination rectangles for each input video stream.
There is a significant deficiency to this approach, beyond the simple issue that the performance of the compositing operation tends to be poor: DirectShow® requires that all input streams to the VMR9 be members of the same filter graph, and thus must all share the same stream clock. This sharing of the stream clock means that if several different video clips are all rendered to inputs on a single VMR9, and the filter graph is told to seek to 1:30 on its media timeline, each video clip will seek to 1:30. The same holds for playback rate; it is not possible to change the playback rate (for example, 70% of real-time) for one stream without changing it for all streams. Finally, one stream cannot be paused, stopped, or rewound independently of the others.
Suppose that a user wants to create an edited video that consists entirely of streaming video currently available on the Internet (or a private sub-network or local disk), while adding his own effects, transitions, and titles, and determining exactly which subsections of the original files he would like to include in the output. Such an operation is essentially impossible today: as described above, the user would need to obtain editable, local copies of each input video, then render the output frame-by-frame using a nonlinear video editor, and finally, compress it and re-stream it for delivery to his audience. Even if the compositing features of the existing VMR9 were leveraged to provide simple alpha blending, movement effects, and primitive transitions, the input videos would all still play on the same stream clock and thus the user would not have control over the timelines of the input videos with respect to the output video.