1. Field of the Invention
The present invention relates generally to methods for editing video files. More particularly, the invention relates to various methods and apparatuses for rapidly seeking to predetermined video frames within a multiplexed audiovisual file. In one aspect, methods and apparatuses for seeking to and accessing the data of a target frame by referencing a file byte off-set is disclosed.
2. Description of the Related Art
MPEG (motion pictures experts group) is a standard promulgated by the International Standards Organization (ISO) to provide a syntax for compactly representing digital video and audio signals. The syntax generally requires that a minimum number of rules be followed when bit streams are encoded so that a receiver of the encoded bit stream may unambiguously decode the received bit stream. As is well known to those skilled in the art, a bit stream will also include a "system" component in addition to the video and audio components. Generally speaking, the system component contains information required for combining and synchronizing each of the video and audio components into a single bit stream. Specifically, the system component allows audio/video synchronization to be realized at the decoder.
Since the initial unveiling of the first MPEG standard entitled MPEG-1, a second MPEG standard known as MPEG-2 was introduced. In general, MPEG-2 provided an improved syntax to enable a more efficient representation of broadcast video. By way of background, MPEG-1 was optimized to handle data at a rate of 1.5 Mbits/second and reconstruct about 30 video frames per second, with each frame having a resolution of 352 pixels by 240 lines (NTSC), or about 25 video frames per second, each frame having a resolution of 352 pixels by 288 lines (PAL). Therefore, decoded MPEG-1 video generally approximates the perceptual quality of consumer video tapes (VHS). In comparison, MPEG-2 is designed to represent CCIR 601-resolution video at data rates of 4.0 to 8.0 Mbits/second and provide a frame resolution of 720 pixels by 480 lines (NTSC), or 720 pixels by 576 lines (PAL). For simplicity, except where distinctions between the two versions of the MPEG standard exist, the term "MPEG," will be used to reference video and audio encoding and decoding algorithms promulgated in current as well as future versions.
Typically, a decoding process begins when an MPEG bit stream containing video, audio and system information is demultiplexed by a system decoder that is responsible for producing separate encoded video and audio bit streams that may subsequently be decoded by a video decoder and an audio decoder. Attention is now directed at the structure of an encoded video bit stream. Generally, an encoded MPEG video bit stream is organized in a distinguishable data structure hierarchy. At the highest level in the hierarchy is a "video sequence" which may include a sequence header, one or more groups of pictures (GOPs) and an end-of sequence code. GOPs are subsets of video sequences, and each GOP may include one or more pictures. As will be described below, GOPs are of particular importance because they allow access to a defined segment of a video sequence, although in certain cases, a GOP may be quite large.
Each picture within a GOP is then partitioned into several horizontal "slices" defined from left to right and top to bottom. The individual slices are in turn composed of one or more macroblocks which identify a square area of 16-by-16 pixels. As described in the MPEG standard, a macroblock includes four 8-by-8 pixel "luminance" components, and two 8-by-8 "chrominance" components (i.e., chroma red and chroma blue).
Because a large degree of pixel information is similar or identical between pictures within a GOP, the MPEG standard takes particular advantage of this temporal redundancy and represents selected pictures in terms of their differences from a particular reference picture. The MPEG standard defines three general types of encoded picture frames. The first type of frame is an intra-frame (I-frame). An I-frame is encoded using information contained in the frame itself and is not dependent on information contained in previous or future frames. As a result, an I-frame generally defines the starting point of a particular GOP in a sequence of frames.
A second type of frame is a predicted-frame (P-frame). P-frames are generally encoded using information contained in a previous I or P frame. As is well known in the art, P frames are known as forward predicted frames. The third type of frame is a bi-directional-frame (B-frame). B-frames are encoded based on information contained in both past and future frames, and are therefore known as bi-directionally predicted frames. Therefore, B-frames provide more compression that both I-frames and P-frames, and P-frames provide more compression than I-frames. Although the MPEG standard does not require that a particular number of B-frames be arranged between any I or P frames, most encoders select two B-frames between I and P frames. This design choice is based on factors such as amount of memory in the encoder and the characteristics and definition needed for the material being coded.
Although the MPEG standard defines a convenient syntax for compactly encoding video and audio bit steams, significant difficulties arise when a segment of an encoded bit stream is clipped out for use in a new bit stream. In particular, because P-frames use information from previous frames in the bit stream, and B frames use information from both previous and future frames, clips must be performed at I-frames. That is, the clipped segment must have an I-frame as a starting frame and a P or an I frame as the final frame in the clipped segment. Performing clips at I-frames therefore eliminates producing video clips that have beginning and ending frames which cannot be decoded without the reference frames contained in the original bit stream.
Unfortunately, typical encoded video bit streams have a larger number of P and B frames in between I-frames. Consequently, this disadvantageously limits the locations at which a clip may be performed, and therefore renders encoded MPEG bit streams unsuitable for the video editing industry which demands frame accurate precision.
A further disadvantage associated with conventional editing engines is an inability to seek to a target video frame without having to time consumingly read and decode each and every frame in a file. That is, before a seek to a particular video frame is performed, an editor must read and decode each video frame in the file to determine the temporal reference of each frame. Once each frame is read and decoded, a seek to the target frame may be performed. Unfortunately, a large majority of video files are of very large proportions. For example, a three hour video file can have up to about 324,000 video frames when the frame rate is 30 frames per second. As can be appreciated, reading and decoding each of the 324,000 video frames before a seek to a target frame is performed is extremely laborious and unsuitable for performing today's video editing tasks. In addition, conventional seeking algorithms must also read and decode a video file before the exact number of frames in a video file are ascertained.
In view of the foregoing, what is needed are methods and apparatuses for efficiently seeking to a target video frame within a video file without having to first laboriously read and decode each and every frame a video file.