1. Field of the Invention
The present invention relates generally to editing audiovisual files. More particularly, the invention relates to various methods and apparatuses for maintaining the audio component of a bit stream substantially synchronized with the video component after performing editing operations are discussed.
2. Description of the Related Art
MPEG (motion pictures experts group) is a standard promulgated by the International Standards Organization (ISO) to provide a syntax for compactly representing digital video and audio signals. The syntax generally requires that a minimum number of rules be followed when bit streams are encoded so that a receiver of the encoded bit stream may unambiguously decode the received bit stream. As is well known to those skilled in the art, a bit stream will also include a "system" component in addition to the video and audio components. Generally speaking, the system component contains information required for combining and synchronizing each of the video and audio components into a single bit stream.
Since the initial unveiling of the first MPEG standard entitled MPEG-1, a second MPEG standard known as MPEG-2 was introduced . In general, MPEG-2 provided an improved syntax to enable a more efficient representation of broadcast video. By way of background, MPEG-1 was optimized to handle data at a rate of 1.5 Mbits/second and reconstruct about 30 video frames per second, with each frame having a resolution of 352 pixels by 240 lines (NTSC), or about 25 video frames per second, each frame having a resolution of 352 pixels by 288 lines (PAL). Therefore, decoded MPEG-1 video generally approximates the perceptual quality of consumer video tapes (VHS). In comparison, MPEG-2 is designed to represent CCIR 601-resolution video at data rates of 4.0 to 8.0 Mbits/second and provide a frame resolution of 720 pixels by 480 lines (NTSC), or 720 pixels by 576 lines (PAL). For simplicity, except where distinctions between the two versions of the MPFG standard exist, the term "MPEG," will be used to reference video and audio encoding and decoding algorithms promulgated in current as well as future versions.
Typically, a decoding process begins when, an MPEG bit stream containing video, audio and system information is demultiplexed by a system decoder that is responsible for producing separate encoded video and audio bit streams that may subsequently be decoded by a video decoder and an audio decoder. Attention is now directed at the structure of an encoded video bit stream. Generally, an encoded MPEG video bit stream is organized in a distinguishable data structure hierarchy. At the highest level in the hierarchy is a "video sequence" which may include a sequence header, one or more groups of pictures (GOPs) and an end-of sequence code. GOPs are subsets of video sequences, and each GOP may include one or more pictures. As will be described below, GOPs are of particular importance because they allow access to a defined segment of a video sequence, although in certain cases, a GOP may be quite large.
Each picture within a GOP is then partitioned into several horizontal "slices" defined from left to right and top to bottom. The individual slices are in turn composed of one or more macroblocks which identify a square area of 16-by-16 pixels. As described in the MPEG standard, a macroblock includes four 8-by-8 pixel "luminance" components, and two 8-by-8 "chrominance" components (i.e., chroma red and chroma blue).
Because a large degree of pixel information is similar or identical between pictures within a GOP, the MPEG standard takes particular advantage of this temporal redundancy and represents selected pictures in terms of their differences from a particular reference picture. The MPEG standard defines three general types of encoded picture frames. The first type of frame is an intra-frame (I-frame). An I-frame is encoded using information contained in the frame itself and is not dependent on information contained in previous or future frames. As a result, an I-frame generally defines the starting point of a particular GOP in a sequence of frames.
A second type of frame is a predicted-frame (P-frame). P-frames are generally encoded using information contained in a previous I or P frame. As is well known in the art, P frames are known as forward predicted frames. The third type of frame is a bi-directional-frame (B-frame). B-frames are encoded based on information contained in both past and future frames, and are therefore known as bi-directionally predicted frames. Therefore, B-frames provide more compression that both I-frames and P-frames, and P-frames provide more compression than I-frames. Although the MPEG standard does not require that a particular number of B-frames be arranged between any I or P frames, most encoders select two B-frames between I and P frames. This design choice is based on factors such as amount of memory in the encoder and the characteristics and definition needed for the material being coded.
Although the MPEG standard defines a convenient syntalx for compactly encoding video and audio bit steams. Audio synchronization difficulties arise when a copied audiovisual bit stream segment is joined with another copied audiovisual bit stream segment. The synchronization problem is partially due to the fact that audio frames and video frames rarely have a one-to-one correlation. Therefore, when a segment of video frames is identified for copying from a file, the identified video frames will not have a pre-determined number of audio frames that correspond to the identified video frames.
Consequently, when a segment of video is copied from a file and then subsequently joined to another copied segment, the audio component of the copied segment may not be synchronized with the proper video frame. Once the video and audio frames are no longer synchronized, an "error" representing the number or percentage of an audio frame for which the video and audio frames fail to be synchronized is introduced into the resulting bit stream. By way of example, the synchronization error introduced from two bit stream segments being joined may be as little as a fraction of an audio frame, to as large as a few audio frames.
Although the error associated with joining only two bit stream segments may in certain cases only be a few audio frames, when a multiplicity of bit stream segments are joined in a more sophisticated editing task, the errors for each joined segment are summed. Therefore, the resulting error may be quite large, and the resulting audio frames may be severely un-synchronized and fail to make sense upon playback. Further, un-synchronized audio and video bit streams typically produce audio discontinuities at the bit stream locations where segments are joined. This problem is commonly described as a "popping" sound. Thus, as discontinuities are introduced to joined bit stream segments, discomforting popping sounds are introduced causing the resulting audio stream to not only be un-synchronized, but also intolerable.
In view of the foregoing, what is needed are methods and apparatuses for editing audio and video bit streams while ensuring that the audio component remains substantially synchronized with the video component.