Video recordings and surveillance are becoming increasingly ubiquitous as a cost of video recording equipment and infrastructure decreases, as well as size and weight shrinking to easily wearable sizes. Recording cameras may be economically located on fixed locations (e.g., walls, ceilings, street corners, etc.), or mobile objects (e.g., vehicles, bikes, drones), or wearable objects as well (e.g., helmets, glasses, augmented reality headsets, or body-worn cameras). Multiple cameras in an environment may be available as stationary installations, as well as transient portable units. Furthermore, videos may be recorded by bystanders with portable video recorders, such as for example, on a smartphone. Such videos may be posted publicly on a video storage or video sharing website and may beneficially provide additional points of view.
In some applications, it is desirable to synchronize multiple video recordings and play back the multiple videos simultaneously on a single timeline, for example, to examine an event of interest such as in a law enforcement action. The synchronization task, however, is difficult, and is performed manually. To illustrate some difficulties, in an example scenario involving multiple police officers wearing body cameras having video and audio recording capabilities, it may be that not all officers are always in the same location at the same time and lack common audio. It may also be that the cameras have such different perspectives that the cameras have no common elements for a human operator to visually recognize as a common cue to perform the synchronization. Furthermore, audio tracks from different perspectives may also be so different that a human operator cannot recognize common elements, for example, if one officer is standing some distance away from another and background noise around each officer is different due to crowd or traffic noise. There is also typically far more background noise (as compared to a studio environment), which simply complicates the alignment. In such cases, manual alignment may be practically impossible. Nevertheless, evidence videos are currently synchronized manually, with each hour of video requiring about three hours of operator time to align. Unfortunately, manual synchronization is not always accurate and the results of such alignment have been disregarded or unusable for some circumstances.
Some existing ways to perform alignment of videos includes human operators using embedded time codes in the videos. A problem with this approach is that the time codes are not necessarily accurate due to the cameras not being synchronized to a time source. Independent camera sources are not tied to the same time source, in contrast to studio cameras that are each connected to the same time source and receive a clock signal to maintain synchronization.
Another problem is that different cameras may have slightly different speed ratios (i.e., speeds or frame rates at which video is captured) depending on a quality of the camera. This is due to independently generated time bases that deviate from a nominal clock rate by a small amount, which may be on an order of less than 1%. But, even this small amount provides a synchronization challenge in that if the videos are played from a single point of synchronization, the videos could diverge by more than 1 second after 2 minutes. Human auditory perception can detect differences less than 1 millisecond for binaural direction discernment. Thus, an example video length of 10 minutes, or 600,000 milliseconds, may result in synchronization divergence being detectable with as little as 1.7 ppm (parts per million) speed ratio error. In practice, differences of over 100 ppm are common, and thus, speed ratio differences among different cameras present a large challenge to synchronization of videos.