1. Field of the Invention
This invention relates to aligning data streams (e.g., sets of visual and/or audio data). The invention particularly relates to quantized alignment (i.e., alignment at a lower temporal resolution than that of the data that is used to do the alignment) of wide-bandwidth data streams. The invention also particularly relates to selecting distinctive audio segments for cross-correlation to enable alignment of data streams including audio data.
2. Related Art
There are various situations in which it is desirable to use high resolution information to provide quantized estimates of the optimal alignment between two data streams. An example of this is using audio samples from two sets of audiovisual data to estimate how many video frames (which are obtained at a relatively low rate compared to that at which audio samples are obtained) to offset one video stream relative to another for optimal alignment of the audio and, by association, the video frames and (if applicable) the associated metadata of the two sets of audiovisual data. In such case, since it does not make sense, from the video point of view, to talk about offsets other than in video frame rate increments, a situation exists in which the data that it is desired to use for alignment (the audio data) is much higher resolution than the alignment information that it is desired to estimate.
An approach could be taken of cross-correlating the two data streams at the high resolution of the data (e.g., at the resolution of audio samples) and then, after finding the highest normalized correlation location, quantizing that location to a lower resolution of the data (e.g., to a multiple of the video frame rate). This approach has the disadvantage of requiring more computation than would nominally be expected for the number of distinct alignment possibilities that are ultimately being considered.
Another approach would be to use the high resolution data (e.g., audio samples), but only sample the cross-correlation at a lower resolution (e.g., once every video frame period). This has the distinct disadvantage of undersampling the cross-correlation function relative to its Nyquist rate: since the cross-correlation function is not being sampled often enough, it is very likely that the optimal alignment will be missed and, instead, some other alignment selected that is far from the best choice. See A. V. Oppenheim, R. W. Schafer, Discrete-Time Signal Processing (Prentice Hall, 1989), for more detailed discussion of undersampled signals and aliasing.
Still another approach would be to low pass (or band pass) filter the high resolution data streams before attempting the cross-correlation. In this case, the cross-correlation function can be sampled at the lower resolution without worrying about the Nyquist rate: the low pass (or band pass) filter of the inputs into the cross-correlation function ensures that the Nyquist requirements are met. However, low pass (or band pass) filtering the input data so severely is likely to remove many of the distinctive identifying characteristics of the high resolution data streams, thus degrading the ability or the cross-correlation to produce accurate alignment. For example, if this approach is used with two audiovisual data streams, even if a “good” band is selected to pass, there are not many distinguishing features left in an audio signal that has been filtered down to a 15 Hz bandwidth 115 Hz=30 Hz/2, since sampling occurs at 30 Hz and Nyquist requires 2 samples/cycle).
Additionally, there are various situations in which it is desired to use a short segment from each of two long audio data streams to estimate an alignment between the two audio data streams and any associated data (e.g., video data, metadata). An example of this is using audio samples from two sets of audiovisual data to estimate how many frames to offset one video stream relative to another for optimal alignment of the audio and, by association, the video frames and (if applicable) the associated metadata of the two sets of audiovisual data. Since the amount of computation that is required for the cross-correlation varies as N log N, where N is the segment length that is being used in the cross-correlation, it is typically not desirable to use the full audio streams. Instead, it is desirable to select a short segment from one of the audio streams that is both stable (i.e., unlikely to “look different” after repeated digitization) and distinctive. (Stability can be an issue, for example, in applications in which a first digitization uses automatic gain control and a second digitization doesn't, so that it is necessary to be careful about picking segments with low power in the frequency bands at which the automatic gain control responds.) If these two criteria are met, a single, clear-cut correlation peak that is well localized and is well above the noise floor can be obtained.
One way to select such a short segment would be to examine the auto correlation function over local windows. This approach has the disadvantage of being computationally expensive: it requires on the order of N log N computations for each N-length local window that is considered.