1. Field of the Invention
The present invention relates generally to editing audiovisual files. More particularly, the invention relates to various methods and apparatuses for maintaining the audio component of a bit stream substantially synchronized with the video component after performing editing operations are discussed.
2. Description of the Related Art
MPEG (motion pictures experts group) is a standard promulgated by the International Standards Organization (ISO) to provide a syntax for compactly representing digital video and audio signals. The syntax generally requires that a minimum number of rules be followed when bit streams are encoded so that a receiver of the encoded bit stream may unambiguously decode the received bit stream. As is well known to those skilled in the art, a bit stream will also include a xe2x80x9csystemxe2x80x9d component in addition to the video and audio components. Generally speaking, the system component contains information required for combining and synchronizing each of the video and audio components into a single bit stream.
Since the initial unveiling of the first MPEG standard entitled MPEG-1, a second MPEG standard known as MPEG-2 was introduced. In general, MPEG-2 provided an improved syntax to enable a more efficient representation of broadcast video. By way of background, MPEG-1 was optimized to handle data at a rate of 1.5 Mbits/second and reconstruct about 30 video frames per second, with each frame having a resolution of 352 pixels by 240 lines (NTSC), or about 25 video frames per second, each frame having a resolution of 352 pixels by 288 lines (PAL). Therefore, decoded MPEG-1 video generally approximates the perceptual quality of consumer video tapes (VHS). In comparison, MPEG-2 is designed to represent CCIR 601-resolution video at data rates of 4.0 to 8.0 Mbits/second and provide a frame resolution of 720 pixels by 480 lines (NTSC), or 720 pixels by 576 lines (PAL). For simplicity, except where distinctions between the two versions of the MPEG standard exist, the term xe2x80x9cMPEG,xe2x80x9d will be used to reference video and audio encoding and decoding algorithms promulgated in current as well as future versions.
Typically, a decoding process begins when, an MPEG bit stream containing video, audio and system information is demultiplexed by a system decoder that is responsible for producing separate encoded video and audio bit streams that may subsequently be decoded by a video decoder and an audio decoder. Attention is now directed at the structure of an encoded video bit stream. Generally, an encoded MPEG video bit stream is organized in a distinguishable data structure hierarchy. At the highest level in the hierarchy is a xe2x80x9cvideo sequencexe2x80x9d which may include a sequence header, one or more groups of pictures (GOPs) and an end-of sequence code. GOPs are subsets of video sequences, and each GOP may include one or more pictures. As will be described below, GOPs are of particular importance because they allow access to a defined segment of a video sequence, although in certain cases, a GOP may be quite large.
Each picture within a GOP is then partitioned into several horizontal xe2x80x9cslicesxe2x80x9d defined from left to right and top to bottom. The individual slices are in turn composed of one or more macroblocks which identify a square area of 16-by-16 pixels. As described in the MPEG standard, a macroblock includes four 8-by-8 pixel xe2x80x9cluminancexe2x80x9d components, and two 8-by-8 xe2x80x9cchrominancexe2x80x9d components (i.e., chroma red and chroma blue).
Because a large degree of pixel information is similar or identical between pictures within a GOP, the MPEG standard takes particular advantage of this temporal redundancy and represents selected pictures in terms of their differences from a particular reference picture. The MPEG standard defines three general types of encoded picture frames. The first type of frame is an intra-frame (I-frame). An I-frame is encoded using information contained in the frame itself and is not dependent on information contained in previous or future frames. As a result, an I-frame generally defines the starting point of a particular GOP in a sequence of frames.
A second type of frame is a predicted-frame (P-frame). P-frames are generally encoded using information contained in a previous I or P frame. As is well known in the art, P frames are known as forward predicted frames. The third type of frame is a bi-directional-frame (B-frame). B-frames are encoded based on information contained in both past and future frames, and are therefore known as bi-directionally predicted frames. Therefore, B-frames provide more compression that both I-frames and P-frames, and P-frames provide more compression than I-frames. Although the MPEG standard does not require that a particular number of B-frames be arranged between any I or P frames, most encoders select two B-frames between I and P frames. This design choice is based on factors such as amount of memory in the encoder and the characteristics and definition needed for the material being coded.
Although the MPEG standard defines a convenient syntax for compactly encoding video and audio bit steams. Audio synchronization difficulties arise when a copied audiovisual bit stream segment is joined with another copied audiovisual bit stream segment. The synchronization problem is partially due to the fact that audio frames and video frames rarely have a one-to-one correlation. Therefore, when a segment of video frames is identified for copying from a file, the identified video frames will not have a pre-determined number of audio frames that correspond to the identified video frames.
Consequently, when a segment of video is copied from a file and then subsequently joined to another copied segment, the audio component of the copied segment may not be synchronized with the proper video frame. Once the video and audio frames are no longer synchronized, an xe2x80x9cerrorxe2x80x9d representing the number or percentage of an audio frame for which the video and audio frames fail to be synchronized is introduced into the resulting bit stream. By way of example, the synchronization error introduced from two bit stream segments being joined may be as little as a fraction of an audio frame, to as large as a few audio frames.
Although the error associated with joining only two bit stream segments may in certain cases only be a few audio frames, when a multiplicity of bit stream segments are joined in a more sophisticated editing task, the errors for each joined segment are summed. Therefore, the resulting error may be quite large, and the resulting audio frames may be severely un-synchronized and fail to make sense upon playback. Further, un-synchronized audio and video bit streams typically produce audio discontinuities at the bit stream locations where segments are joined. This problem is commonly described as a xe2x80x9cpoppingxe2x80x9d sound. Thus, as discontinuities are introduced to joined bit stream segments, discomforting popping sounds are introduced causing the resulting audio stream to not only be un-synchronized, but also intolerable.
In view of the foregoing, what is needed are methods and apparatuses for editing audio and video bit streams while ensuring that the audio component remains substantially synchronized with the video component.
To achieve the foregoing in accordance with the purpose of the present invention, methods and apparatuses for maintaining edited audiovisual files substantially synchronized during editing operations performed through the use of an editing engine are disclosed. Preferably, the editing engine performs editing operations in two passes through an edit list. In one embodiment, the edit list may contain a number of copying requests instructing the editing engine to create a copy operator for copying segments of audio and video from certain files. To initiate copy operations, the editing engine preferably performs a first pass where the copied segments of an audio and video have an audio component that is preferably longer in time than the video component.
In another embodiment, a predetermined number of audio frames at each end of the copied audio segment may be decoded and re-encoded to generate glue frames which may provide, e.g., sound fading and blending effects. Once the copied segments of audio are processed in the first pass, the editing engine will initiate a second pass through the editing list to stitch together (i.e., join) the processed audio and video segments into a single file. Advantageously, during the stitching operation, frames at the ends of each copied audio segment (i.e., tab-in and tab-out audio frames) may be dropped or retained in order to maintain the audio component in the newly created audiovisual file substantially synchronized with the video component. Therefore, the newly created file is advantageously made up of one or more audiovisual segments that preferably has an audio component that is no more than about half an audio frame in error.
In yet another embodiment, a method for copying a segment from an audiovisual file having a multiplicity of audio frames and a multiplicity of video frames is disclosed. In a first step, a mark-in location in a video file is selected to correspond to a first video frame in the segment such that the first video frame has an associated start time. Next, a mark-out location in the video file is selected to correspond to a last video frame in the segment, and the last video frame having an associated end time. Once the mark-in video frame is selected, a first audio frame having a first audio frame start time that is at least as early as the first video frame start time is designated as an initial audio frame. A second audio frame having a second audio frame start time that is at least as late as the last video frame end time is designated as the last audio frame. The audiovisual file is copied to include a video portion extending from the first video frame to the last video frame and an audio portion extending from the initial audio frame to the last audio frame. In this manner, the audio portion of the segment may preferably be longer than the video portion of the copied segment.
In still another embodiment, a method of stitching a first and second audiovisual segment together is disclosed. In this embodiment, each audiovisual segment has a multiplicity of audio frames including a first audio frame, a second audio frame that sequentially follows the first audio frame and a last audio frame. The audiovisual segment further includes a multiplicity of video frames having a first video frame and a last video frame. The method includes the step of aligning an initial audio frame in the first audiovisual segment with the first video frame in the first audiovisual segment. The first audio frame from the first audiovisual segment is designated as the initial audio frame when a tab error associated with the first audio frame from the first audiovisual segment is less than about half a frame. On the other hand, the second audio frame from the first audiovisual segment is designated as the initial audio frame when a tab error associated with the first audio frame from the first audiovisual segment is greater than half a frame. Stitching the first and second audiovisual segments together.
In another embodiment, a method of joining a first and a second audiovisual segment together while maintaining substantial audio to video synchronization is disclosed. Each audiovisual segment having a multiplicity of audio frames including a first audio frame, a second audio frame that sequentially follows the first audio frame and a last audio frame. A multiplicity of video frames including a first video frame and a last video frame are also disclosed. In this embodiment, the method includes a step of aligning an initial audio frame in the first audiovisual segment with the first video frame in the first audiovisual segment. Preferably, the first audio frame from the first audiovisual segment is designated as the initial audio frame when a tab error associated with the first audio frame from the first audiovisual segment is less than about half an audio frame. Further, the second audio frame from the first audiovisual segment is designated as the initial audio frame when a tab error associated with the first audio frame from the first audiovisual segment is greater than about half an audio frame. On the other hand, the first audio frame from the first audiovisual segment is dropped when the second audio frame from the first audiovisual segment is designated as the initial audio frame. The method further includes determining whether a cumulative error associated with the last audio frame in the first segment exceeds half a frame, and dropping the last audio frame in the first segment when it is determined that the cumulative error associated with the last audio frame exceeds half a frame. The method then determines whether a cumulative error associated with the first audio frame in the second segment exceeds about half a frame, and dropping the first audio frame in the second segment when it is determined that the cumulative error associated with the first audio frame exceeds about half a frame.
Although the advantages are numerous, a particular advantage of this invention is that the stream error is prevented from exceeding about half an audio frame, and the video frames are substantially synchronized with the audio frames without regard to the number of segments being stitched together after successive copy operations. It should also be appreciated that if corrections were not made by dropping or retaining audio frames in the second pass as described above, the cumulative stream error would grow and propagate as additional audiovisual segments are stitched together.