The present invention relates to image sequence compression. More particularly, this disclosure provides a compression system that utilizes independently coded regions to permit select extraction of image objects, or editing of select areas of an image frame, without necessarily decompressing all image data in each frame. This disclosure also provides a mechanism of tracking the objects and regions across multiple frames such that, if desired, they may be independently coded and extracted from a video sequence.
Conventional editing or other processing of film or video images is performed in the xe2x80x9cspatialxe2x80x9d domain, that is, upon actual images rather than upon a compressed representation of those images. Since the final product of such editing or processing is frequently an uncompressed signal (such as a typical xe2x80x9cNTSCxe2x80x9d television signal), such editing or processing can sometimes with today""s digital editors and computers be accomplished in real-time. With increasing tendency toward high resolution pictures such as high definition television (xe2x80x9cHDTVxe2x80x9d), however, Internet, cable, television network and other service providers will likely all have to begin directly providing compressed signals as the final product of editing. As used herein, the term xe2x80x9cvideoxe2x80x9d will refer to any electronic signal that represents a moving picture sequence, whether digital, NTSC, or another format.
One problem relating to the new digital standards relates to efficiently and quickly processing video; with video stored or transmitted in compressed format under the new standards, it is difficult computationally to decompress video, process that video in the spatial domain, and then recompress output video. Examples of processing compressed video prior to display include providing fast forward, reverse and other effects typically associated with VCRs. Other processing examples associated with the production or broadcast of video include color correction, logo insertion, blue matting, and other conventional processes.
To take one example of this computational difficulty, in logo insertion, a local television station might receive a compressed satellite feed, insert its own TV station logo in a corner of the image that will be seen on viewers"" TV sets, and then broadcast a TV signal over cable, back over satellite or through the airwaves. Conventionally, the processing could be performed in real time or with a short delay, because it is relatively easy to decompress an image, modify that image in the spatial domain and transmit a spatial domain signal (e.g., an uncompressed NTSC signal). With HDTV and other new digital standards, which call for all transmissions in a compressed format, this quick processing becomes much more difficult, since it is very computationally expensive to compress a video signal.
All of the video examples given above, e.g., logo insertion, color correction, fast forward, reverse, blue matting, and similar types of editing and processing procedures, will collectively be referred to interchangeably as xe2x80x9ceditingxe2x80x9d or xe2x80x9cprocessingxe2x80x9d in this disclosure. xe2x80x9cFast forwardxe2x80x9d and similar features commonly associated with a video cassette recorder (xe2x80x9cVCRxe2x80x9d) are referred to in this manner, because it may be desired to change the sequence or display rate of frames (thereby modifying an original video signal) and output a new, compressed output signal that includes these changes. The compressed output signal will often require that frames be re-ordered and re-encoded in a different format (e.g., to depend upon different frames), and therefore is regarded as one type of xe2x80x9cediting.xe2x80x9d
In most of the examples given, since editing or processing is typically done entirely in the spatial domain, a video signal must typically be entirely decompressed to the spatial domain, and then recompressed. These operations are typically required even if only a small part of an image frame (or group of frames) is being edited. For example, taking the case of logo insertion in the bottom right corner of an image frame, it is extremely difficult to determine which part of a compressed bit stream represents a frame""s bottom right corner and, consequently, each frame of the video sequence is typically entirely decompressed and edited. If it is desired to form a compressed output signal, frames of the edited signal must then typically be compressed anew.
In this regard, many compression formats are based upon xe2x80x9cmotion estimationxe2x80x9d and xe2x80x9cmotion compensation.xe2x80x9d In these compression formats, blocks or objects in a xe2x80x9ccurrentxe2x80x9d frame are recreated from similar blocks or objects in one or two xe2x80x9canchorxe2x80x9d frames; xe2x80x9cmotion estimationxe2x80x9d refers to a part of the encoding process where a computer for each block or object of a current frame searches for a similar image pattern within a fairly large area of each anchor frame, and determines a closest match within this area. The result of this process is a motion vector which usually describes the relative position of the closest match in an anchor frame. xe2x80x9cMotion compensationxe2x80x9d refers to another part of the encoding process, where differences between each block or object and its closest match are taken, and these differences (which are ideally all zeros if the match is xe2x80x9cgoodxe2x80x9d) are then encoded in some compact fashion, often using a discrete cosine transform (xe2x80x9cDCTxe2x80x9d). These processes simply imply that each portion of the current frame can be almost exactly reconstructed using the location of a similar looking portion of the anchor frame as well as difference values. Not every frame in a sequence is compressed in this manner.
Motion estimation is very computationally expensive. For example, in applying the MPEG-2 standard, a system typically takes each block of 8xc3x978 pixels and searches for a closest match within a 15xc3x9715 pixel search window, centered about the expected location for the closest match; such a search involves 64 comparisons to find the closest match, and each comparison in turn requires 64 separate subtractions of multi-bit intensity values. When it is considered that a typical image frame can have thousands of 8xc3x978 pixel blocks, and that this searching is typically performed for the majority of frames in a video sequence, it becomes quite apparent that motion estimation is a computationally expensive task.
With the expected migration to digital video and more compact compressed transmission formats, it is apparent that a definite need exists for quick compression systems and for systems which provide quick editing ability. Ideally, such a system should permit decoding and editing of a compressed signal (e.g., VCR functions, logo insertion, etcetera) yet permit real-time construction and output of compressed, edited video signal that can be accepted by HDTV and other new digital systems. Ideally, such a system would operate in a manner compatible with existing object-based and block-based standards and desired editing procedures, e.g., such that it can specially handle a logo to be inserted into a compressed signal, as well as other forms of editing and processing. Further still, such a system ideally should be implemented as much as possible in software, so as to be compatible with existing computers and other machines which process video. The present invention satisfies these needs and provides further, related advantages.
The present invention solves the aforementioned needs by providing a system having independently coded regions. Using these regions, one may specially compress and encode a data sequence in a manner that permits extraction or editing of select objects in the spatial domain, without need to decode and decompress entire sequences. If it is desired to modify a compressed output signal to include modified data for an object (e.g., for an edited object), new data can be inserted as appropriate in the place of the extracted object; with the object being independently coded, all other compressed data for the sequence (e.g., background or other specific objects) may be exactly re-used. In real time applications, this ability facilitates editing and production of a compressed output signal using standard computer and editing equipment. As can be seen therefore, the present invention should have ready application to production, post production, network syndication, Internet, and other applications which call for the production of compressed video, audio and other signals.
The invention provides an apparatus that produces a signal representing multiple compressed data frames. The apparatus may be applied to audio or video data, or any other type of data that is suitable for storage or transmission as a sequence of related data frames. In the preferred embodiment, this form of the invention is applied to compressed video frames to generate independently coded regions as part of an output video sequence. The preferred embodiment may be applied by a network or video production house to generate an image sequence in compressed format (e.g., satellite transmission, DVD program, video tape or other program) in a manner optimized for quick or real-time editing. To take a few examples, with a compressed image sequence already processed to have independently coded regions, a local television station may insert logos and a post production house may provide color correction without completely decompressing and processing the entire image sequence, i.e., by processing only one or a small number of independently coded regions. Alternatively, the preferred embodiment may also be implemented in a digital VCR or by a local television station; by performing minor editing or processing (e.g., signal mixing, frame re-ordering for fast forward, logo insertion, etc.) without having to completely re-encode an entire video sequence, these entities may more easily generate a digital (HDTV) output signal in real-time or close to real-time.
According to a first form of the invention, a compression system encodes at least one data frame as an anchor frame and at least one other data frame in dependent format, such that each dependent frame may be recreated from one or two anchor frames. This form of the invention calls for identifying at least two data sets (appearing across multiple image frames) that are to be compressed independently of one another, and also for constraining motion search and compensation such that motion vectors for each data set in a dependent frame may only point to the same data set in one or two anchor frames. xe2x80x9cData setsxe2x80x9d can refer to an object that appears in multiple frames (the object can vary in shape size, color, intensity, etc.), as well as a static shape and position (e.g., each screen""s lower right-hand corner, irrespective of image content).
In a second form of the invention, there will be at least two frames, one of which is to be compressed as a dependent frame, and another of which is to be compressed as an anchor frame. Typically, the dependent frame is recreated by first decompressing the anchor frame to generate spatial domain data and, second, taking motion vectors and residuals associated with the dependent frame and xe2x80x9cbuildingxe2x80x9d the dependent frame""s content using xe2x80x9cpiecesxe2x80x9d of the already-decompressed anchor frame. This form of the invention calls for generating a compressed output signal by providing a user with ability to designate spatial domain data in a dependent frame, by automatically associating data from another, anchor frame with that data, and by compressing an output sequence in a manner such that the dependent frame is compressed into motion vector-plus-residual format, with all motion vector dependency of the dependent frame constrained to only point to associated data of an anchor frame.
Other forms of the invention are set forth by the claims below, including various methods, apparatuses and improvements. In more particular aspects, these forms of the invention may be implemented as video or audio encoders, transcoders and editing devices.
The invention may be better understood by referring to the following detailed description, which should be read in conjunction with the accompanying drawings. The detailed description of a particular preferred embodiment, set out below to enable one to build and use one particular implementation of the invention, is not intended to limit the enumerated claims, but to serve as a particular example thereof.