When editing audio and video captured either by multiple cameras or in multiple takes of the same scene (e.g., with a single audio-video capture device), traditional media editing applications typically operate on the premise that audio portions captured at different cameras angles are coextensive with the captured video, and, thus, align at a common point in time. But this is often not the case. In practice, audio in multiple takes vary due slight variances in delivery, volume, word usage, utterances, etc. For example, the actors can ostensibly deliver the same lines in each take, but they might inevitably differ somewhat in timing. Sometimes they will actually say slightly different things as well, which varies the audio from take to take. Whereas, in multiple camera applications, the spatial arrangement of the cameras, as well as the environment, can also contribute to deviations in audio relative to some point in time. These deviations, which can be as small as a fraction of a second, can lead to two or more captured audio portions being out of synchronization as perceived, for example, by a human listener. Further, the efforts to edit audio and video captured in digitized form are usually exacerbated by the amounts of raw audio and video requiring editing. Specifically, editors typically expend much effort, usually manually, to search through significant amounts of content to find audio that can be synchronized for use in a final product.
FIG. 1 illustrates a multi-camera arrangement 100 for capturing video and audio of a subject 108 at different angles and positions. As shown, capture devices 102a, 102b, and 102c, which are typically cameras, are arranged at different angles A1, A2, and A3 relative to reference 110. Further, these capture devices are positioned at different positions, P1, P2, and P3 in space from subject 108. In this typical multi-camera arrangement 100, these angles and positions, as well as other various factors, such as the occurrence of ambient noise 104 near capture device 102a, affect the synchronization (and/or the quality) of the audio portions as they are captured. In addition, multiple takes of the same scene, whether it is with multiple cameras or a single camera, can have inherent deviations (e.g., different rate of delivery of speech, differing utterances that can include different spoken words) among the other deviations stated above.
One common technique for identifying similar video captured at capture devices 102a, 102b, and 102c is to implement time codes associated with each video (or otherwise use some sort of global synchronization signal) to synchronize both the video and audio portions. In particular, a user is usually required to manually adjust the different videos to bring their time codes into agreement. A time code normally describes the relative progression of a video images in terms of an hour, minute, second, and frame (e.g., HH:MM:SS:FR). But a drawback to using time codes to identify similar audio (e.g., to synchronize audio) requires the user to identify different video portions to a particular frame before synchronizing the audio portions. The effort to identify similar audio portions is further hindered due to the number of samples of audio sound that is captured relative to the number of video frames. Typically, for each frame of video (e.g., 30 frames per second), there are 1,600 samples of audio (e.g., 48,000 samples per second). As such, audio portions for capture devices 102a, 102b, and 102c are typically synchronized based on the video portions and their time codes, which can contribute to undesired sound delays and echoing effects. Another common technique for synchronizing the audio (and the video) captured at capture devices 102a, 102b, and 102c is to use a clapper to generate a distinctive sound during the capture of the audio and video. A clapper creates an audible sound—as a reference sound—to synchronize audio during the capture of the audio. The clapper sound is used for editing purposes and would otherwise be discarded during editing. The time codes and clapper sounds thus require effort to ensure their removal as they are intended for editing purposes and are distracting to an audience if time codes remain visible or clapper sounds remain audible in the final product. A drawback to using a clapper as noise 104 to synchronize audio is that the distance from noise and capture devices 102a, 102b, and 102c can cause delays that hinder synchronization of the audio relating to scene 108.
It would be desirable to provide improved computing devices and systems, software, computer programs, applications, and user interfaces that minimize one or more of the drawbacks associated with conventional techniques for identifying acoustic patterns to, for example, synchronize either audio or video, or both.
Like reference numerals refer to corresponding parts throughout the several views of the drawings. Note that most of the reference numerals include one or two left-most digits that generally identify the figure that first introduces that reference number.