Historically, video editing systems were clumsy and complicated, requiring many steps to derive a composite video from a set of source media. Film editing systems, such as Moviolas, physically spliced film together in a destructive manner. Tape-based video editing systems required at least two sources and a destination video player, and complex hardware to keep the systems synchronized. In both these systems, once a section of video was edited, it was very difficult to go back and make further changes.
Modern, computer-based video editing systems, such as Avid®, iMovie®, and Premiere®, allow users to easily undo and modify editing decisions, but still require human intervention for each decision. A user must review the source material (e.g., any media content that is available to create a composite video), selecting start and end times in each source clip (often called “in” and “out” points) to create a trimmed clip (e.g., a portion of a media item). Then, the user must select a time in the final composite video that the source clip should be played. Optionally, the user may select effects, such as cross-dissolves, color changes, titles and so on, to apply to either the source media or the final composite video. Once these decisions are made, the video editing system can produce an Edit Decision List, or EDL, which specifies these decisions. “Offline” systems, such as Avid®, can produce a preview of the video from the EDL and low-quality versions of the source material. The EDL can then be used to “render” a full-quality version of the video. “Online” systems can produce full-quality version of video immediately. This final and full version, or composite video, is the video created by combing the trimmed clips.
Music Videos
Music videos are videos designed to accompany a song or other musical content. They are generally designed to enhance the experience of listening to a song. The process of making music videos is an art with different requirements from those of a standard videos. Specific planning must be performed while creating the source media to ensure that sufficient timing information exists to allow the editor to create a composite video that contains elements that properly correspond to the musical content. While a great many techniques exist to accomplish this task, most music videos are created using the following process: 1. A song is selected. 2. the song is played back while video is recorded while performers in the video time their actions to the music (e.g., by dancing to the beat or lip-syncing). In some cases, the video recording system is carefully synchronized to the audio using a variety of complex technologies, including time-code (e.g., SMPTE), blackburst, and so on, so that differences in playback and recording speeds do not cause the audio and video to drift during playback later. In practice, the speed of audio and video may be close enough that they do not need to be synchronized. 3. Timing information is recorded so that the video and song can be synchronized later. This is often done with a standard film slate. 4. The editor synchronizes all the video and audio using editing software by using the recorded timing information. The video and audio are considered synchronized when they are played back substantially simultaneously, or are scheduled to be played back simultaneously. 5. The editor proceeds as with a standard video, careful to maintain the synchronization between the audio and video. Steps 2 and 3 may be repeated to create additional source video.
In addition to ensuring that the video and audio remain synchronized, the editor must take care to ensure that the edit points, such as transitions from one video to another, and visual effects, occur at musically relevant times. This takes skill, experience, and trial and error. In practice, it is not as simple as creating edits at the downbeats—edits in modern music videos occur at times that are musically relevant, but usually not the main downbeats.
Methods have been proposed for analyzing music to determine musically relevant times automatically. These methods include “onset” detection, such as the onset detection used as a first step in beat detection and audio fingerprinting algorithms. However, existing techniques do not find times that correspond to edit locations in music videos. This may be for several reasons: 1. By design, many onset detection algorithms are optimized to find downbeats, which generally do not correspond to times that a professional video editor would use to edit a video. 2. Many onset detection algorithms do not perform well without manually setting a parameter, such as a threshold parameter, 3. Many onset detection algorithms do not perform well with modern commercial recordings, which typically have very low dynamic range, or a wide variety of source material.
For example, U.S. Pat. No. 6,704,671 to Umminger, III teaches determining sonic events within an audio signal. In order to determine these events, the volume of the audio signal is tracked and a determination is made based on a rate of change of the volume. However, the method utilized by Umminger is ill suited for editing videos because a change in volume will not pick up all relevant onsets.
Additionally, U.S. Pat. No. 8,586,847 to Ellis et al. teaches a method for fingerprinting a music sample using a single low pass filtering technique. This technique only detects onsets in a specific frequency range. However, this technique requires additional logic to continuously adapt to changing dynamics in the music. The technique is silent as to determining onsets in a given frame of an audio signal. Thus, the method utilized by Ellis is not applicable to video editing because frame onsets cannot be determined.
It is possible to create special effects, such as slow motion and fast motion but still maintain synchronization. For example, if a slow motion effect is desired, say playback at half speed, the process must be changed as follows: in step 2 the song is played back at twice the original speed, and the video is recorded at twice the required frame-rate. During step 4, the editor must slow down the video so that it is synchronized with normal audio playback.
In some cases, video may be used that was not created in the above manner, and is therefore not synchronized. This video is called “wild” video, or video that has no timing reference and thus no existing metadata which can be used to synchronize it with other media. When using wild video, the editor may choose to create the illusion of synchronization by changing the playback speed and/or adjusting the start time. This is a time-consuming and tedious process which requires a great deal of trial and error. Moreover, such a process only gives the illusion of synchrony, and does not actual synchronize the video.
For example, U.S. Pat. No. 8,347,210 to Ubillos et al. teaches a method of synchronizing video with beats of an audio track. However, Ubillos requires a user to manually synchronize the video and audio and is silent as to any means for automatic detection.
In practice, the entire editing process is time consuming and often requires much trial and error, as well as manual inspection of source media to determine what is most appropriate. Selecting source media, start, end, and insertion times to produce a compelling video is both an art and a complex discipline requiring much attention to detail. Even with an experienced operator and the most sophisticated equipment it can be time consuming.
As an example, U.S. Patent Pub. No 2015/0149906 to Toff et al. teaches creating a collaborative video from video clips derived from different users. However, Toff does not contemplate any type of automatic editing or synchronization. Moreover, the editing is accomplished manually, in that the users select the start and end times, order, and other properties of the video clips.
Additionally, it is desirable to reduce the amount of bookkeeping required to maintain synchronization between the audio and video and make it easier to synchronize wild video. The present invention addresses this issue as well.
Moreover, it is desirable to create a method and system for detecting onsets that works with a variety of music styles and recording techniques without requiring human intervention and produces results that are consistent with editing times in music videos currently being produced by humans.