1. Field of the Invention
This invention relates to the display of audio and video data and, in particular, to variation of the apparent display rate at which the audio and video data is displayed.
2. Related Art
It is desirable to be able to vary the apparent display rate (i.e., the rate of change of the display as perceived by an observer, as opposed to the rate at which data is processed to generate the display) of a display generated from audio, video, or related audio and video data. For example, it may be desirable to increase the apparent display rate so that a quick overview of the content of the data can be obtained, or because it is desired to listen to or view the display at a faster than normal rate at which the content of the display can still be adequately digested. Alternatively, it may be desirable to slow the apparent display rate so that the display can be more carefully scrutinized, or because the content of the display can be better digested at a slower rate.
Both audio and video data can be represented in either analog or digital form. The method used to manipulate audio and/or video data to accomplish variation in the apparent display rate of a display generated from that data depends upon the form in which the data is represented. However, conventional devices enable data in one form to be easily converted to the other form (i.e., analog to digital or digital to analog), thus affording wide latitude in the use of methods to accomplish display rate variation, regardless of the form in which the data originally exists.
The apparent display rate of an audio display or a video display can be increased or decreased by deleting specified data from, or adding specified data to (e.g., repeating certain data), respectively, a corresponding set of digital audio data or digital video data that represents the content of the display. Previously, such variation of the apparent display rate of either an audio display or a video display has been accomplished using one of a variety of techniques. For example, the apparent display rate of an audio display represented by a set of digital audio data has been varied by using the synchronized overlap add (SOLA) method (discussed in more detail below) to appropriately modify an original set of digital audio data to produce a modified set of digital audio data from which the audio display is generated.
Often, a set of audio data is related to a particular set of video data and the two are used together to generate an audiovisual display, such as occurs, for example, in television broadcasts, motion pictures or computer multimedia displays. When the apparent display rate of an audiovisual display is varied, the audio display and video display must be synchronized to maintain temporal correspondence between the content of the audio and video displays. (Alternatively, the audio display can be eliminated altogether, thus obviating the need to maintain synchronization; however, the content of the audio display is lost.)
Previously, the apparent display rate of an audiovisual display has been varied by deleting or repeating video data (e.g., video frames) in a uniform manner, as appropriate, and deleting or repeating audio data in a uniform manner that corresponds to the treatment of the video data (e.g., if the apparent display rate of the video display is speeded up to 2 times the original display rate by, for example, eliminating every other video frame, then the audio display is likewise speeded up by eliminating every other audio sample or every other set of a predetermined number of audio samples). While this approach is effective in maintaining synchronization, it can cause distortion in the audio and video displays, particularly at relatively high or low apparent display rates. In particular, the audio display can be distorted so that, as the apparent display rate increases, human voices increasingly begin to manifest a xe2x80x9cchipmunk effect,xe2x80x9d and, as the apparent display rate decreases, human voices begin to sound as though the speaker is in a stupor. Such distortion of the display is a consequence of the fact that the elimination of audio data from the original set of audio data is done mechanically, without consideration of the content of the audio data being eliminated or retained.
A better way of varying the apparent display rate of an audiovisual display is desirable. In particular, an approach that xe2x80x9cintelligentlyxe2x80x9d modifies the audio and/or video data used to generate the display based upon an evaluation of the content of the audio data and/or video data is desirable, since such an approach can reduce or eliminate distortion of the display, and, in particular, the audio display. Good synchronization between the audio and video displays should also be maintained. Additionally, the capability of varying the apparent display rate over a wide range of magnitudes is desirable. Further, preferably the variation of the apparent display rate can be accomplished automatically in a manner that produces an apparent display rate that closely tracks a specified target display rate or rates.
The invention enables the apparent display rate of an audiovisual display to be varied. The invention can cause an original set of audio data to be modified in accordance with a target display rate (which can be a single target display rate or a sequence of target display rates, as discussed further below) based upon an evaluation of the content of the audio data set, then cause a related original set of video data to be modified to conform to the modifications made to the original audio data set such that the modified audio and video data sets (and, thus, the displays produced therefrom) are synchronized. When the modified audio and video data sets so produced are used to generate an audiovisual display, the audiovisual display has an apparent display rate (or rates) that approximates the target display rate (or rates). Ensuring that the modified audio and video data sets are synchronized minimizes or eliminates the dissonance (e.g., a temporal mismatch between spoken words in the audio display and the corresponding movement of the speaker""s lips in the video display) that would otherwise be experienced if the audio and video displays were not synchronized. Further, modifying the original audio data set directly, based upon an evaluation of the content of the audio data, to produce variation in the apparent display rate of the audiovisual display is advantageous in that it can enable minimization or elimination of artifacts (e.g., pitch doubling, pops and clicks) in the audio display. Preferably, the original audio data set is modified in a manner that produces a modified audio data set that can be used to generate an audio display having little or no distortion (e.g., there is a reduction or elimination of the tendency of human voices to sound like chipmunks when the apparent display rate is increased above a normal display rate or sound stupefied when the apparent display rate is decreased below a normal display rate). Generally, in accordance with the invention, a target display rate (and, thus, typically, the apparent display rate) can be faster or slower than a normal display rate at which an audiovisual display system generates an audiovisual display from the original sets of audio and video data. In particular, as will be better appreciated from the description below, the methods used to produce the modified audio data set enable a wide range of apparent display rates to be produced without introducing an unacceptable amount of distortion into the audiovisual display (in particular, the audio display).
In one embodiment of the invention, the apparent display rate of an audiovisual display can be varied from a normal display rate at which an audiovisual display system generates the audiovisual display from an original set of audio data and a related original set of video data by: i) defining a correspondence between the original set of audio data and the original set of video data; ii) determining a target display rate (which can be in fact, a sequence of target display rates) for the audiovisual display; iii) creating a modified set of audio data, based upon the target display rate and an evaluation of the content of the original set of audio data, that corresponds to the original set of audio data; and iv) creating a modified set of video data, based upon the modified set of audio data, the correspondence between the modified set of audio data and the original set of audio data, and the correspondence between the original set of audio data and the original set of video data.
A target display rate can be established xe2x80x9cmanuallyxe2x80x9d by a user instruction (i.e., by specification of a nominal target display rate by the user). Alternatively, a target display rate can be established automatically, without user input, based upon analysis of the audiovisual data. Or, a target display rate can be established by automatically modifying a user-specified nominal target display rate based upon analysis of the audiovisual data. As indicated above, when a nominal target display rate is specified by a user, a single target display rate can be specified for the entire audiovisual display, or a series of target display rates, each corresponding to a portion of the audiovisual display, can be specified. Likewise, a single target display rate or a series of target display rates can be automatically established (either xe2x80x9cfrom scratchxe2x80x9d or based upon an initially specified nominal target display rate or rates) in accordance with the invention. Moreover, as will be better appreciated from the description below, the invention enables a user to vary a nominal target display rate in real time as the audiovisual display is being generated.
Any appropriate method of automatically determining a target display rate, or automatically modifying a nominal target display rate, can be used. Such automatic determination or modification of the target display rate can be accomplished by evaluating the original set of audio data, the original set of video data, or both. Moreover, the target display rate can be established automatically by multiple evaluations of the audio and/or video data sets. The audio data set can be evaluated, for example, to determine the stress with which spoken portions of the audio data are uttered (by, for example, computing an energy term for the spoken portions), and the target display rate based upon the relative stresses of the spoken portions of the audio data. Or, the audio data set can be evaluated to determine the speed with which spoken portions of the audio data are uttered (by, for example, ascertaining spectral changes in the spoken portions), and the target display rate based upon the relative speeds of the spoken portions of the audio data. Or, both the stress and speed with which spoken portions of the audio data set are uttered can be determined and combined to produce audio tension values for the spoken portions, the target display rate being based upon the audio tension values of the spoken portions. The video data set can be evaluated, for example, to determine the relative rate of change of the video data along various population-based dimensions (described in more detail below), and the target display rate based upon that evaluation. Or, the video data set can be evaluated by ascertaining portions of the corresponding video image that change quickly, as well as the frequency with which such quick changes occur, and basing the target display rate on the occurrence and frequency of such quick changes. Or, the video data set can be evaluated by tracking the motion of objects within the corresponding video image, and basing the target display rate on the appearance of new objects in the video image.
The modified set of audio data can be created based upon the magnitude of the target display rate and an analysis of the content of the audio data. For example, the modified set am of audio data can be created by; i) dividing the original set of audio data into a plurality of segments, each segment representing a contiguous portion of the set of audio data that occurs during a specified duration of time, each segment being adjacent to one or two other segments such that there are no gaps between segments and adjacent segments do not overlap; ii) overlapping an end portion of a first segment with an adjacent end portion of a second segment that is adjacent to the first segment (the overlap can be negative, as described in more detail below); iii) identifying as part of the modified set of audio data the audio data from the first segment that is not part of the overlapped end portion of the first segment; iv) blending the data of the corresponding overlapped end portions; and v) determining whether there are additional segments in the original set of audio data that have not been overlapped with an adjacent segment, wherein if there are additional segments, the additional segments are processed in accordance with the description above (a new first segment being created from the blended data and the non-overlapped data from the previous second segment), and if there are not additional segments, the blended data and the non-overlapped data from the second segment are included as part of the modified audio data set.
The modified set of video data can be created by: i) establishing a correspondence between the modified audio data set and the original video data set, based upon a correspondence between the modified audio data set and the original audio data set and a correspondence between the original audio data set and the original video data set; ii) grouping the audio data of the modified audio data set into audio segments having the same amount of audio data as found in audio segments of the original audio data set; iii) for each of the audio segments of the modified audio data set, identifying one or more partial or complete subunits of video data from the original video data set that correspond to audio data in the audio segment of the modified audio data set, based upon the correspondence between the modified audio data set and the original video data set; and iv) modifying the video frames in the original video data set as necessary to produce the modified video data set so that there is a one-to-one correspondence between audio segments of the modified audio data set and video frames of the modified video data set. The modified set of video data can be created by eliminating data from the original video data set, adding data to the original video data set, blending data from the original video data set, and/or synthesizing data based on the data in the original video data set.
The modified sets of audio and video data can be stored for later use in generating an audiovisual display, or they can be used immediately to generate an audiovisual display. In particular, in the latter case, the invention can be used to generate an audiovisual display in which the apparent display rate of the display can be varied in real-time. Such real-time variation of the apparent display rate is possible since the method of modifying the audio data set described above does not require knowledge of the audio data of the original audio data set far into the future to enable production of a modified audio data set, but, rather, only the audio data comprising a next segment of the original audio data set. Further, since the calculations for determining modified audio and video data can be done just prior to generating a display from that data, the calculations can be done based on a very recently determined (e.g., specified in real time by a user) target display rate. Moreover, the quantity of calculations required by a method of the invention can be performed by current processing devices sufficiently quickly to enable generation of a real-time display from the modified audio and video data.