1. Field of the Invention
The present invention relates to a method and system of compressing or compacting digital data. More specifically, the invention pertains to an apparatus and method of maximizing the quality and efficiency of transferring compressed digital information.
The present invention improves various aspects of video data compression by dynamically (i.e., real time) applying selected filtering and encoding parameters during a second, delayed encoding of a stored representation of an input digital data structure, e.g., a bit-stream, based on the results of a first encoding of the input data structure (bit-stream).
2. Background Art
Transmission and/or storage of digital data are employed in nearly all data processing and transmission applications. Large scale archiving and retrieval of documents may require very high-capacity storage devices and media. Also, transfer of massive quantities of data over long distance communications facilities is rapidly increasing. Digital storage or communication systems are designed with a capacity limit for total bits or maximum bit rate, typically termed the 11 bit budget 11. In order to reduce the costs associated with the transfer and/or storage of massive amounts of data, the bit budget, i.e. the bit rate transmitted through a transmission channel of fixed capacity and/or the total number of bits to be stored in a storage media, is minimized as much as possible. In the field of digital video transmission or digital video storage and retrieval, great attention has been directed to data compression and compaction to reduce costs and improve performance under bit budget constraints.
Data compression may be classified as reversible or irreversible, where reversible means there is no loss of information, as opposed to irreversible, where there is some loss of non-relevant information (where the non-relevance depends on the given context). Irreversible compression is generally called compaction although there are no standard definitions. In the context of video transmission and storage, the term compression is often be used loosely with either sense. For the purposes of this application, compression means either reversible or irreversible, unless otherwise specifically stated.
A basic reference on data compression is found in Encyclopedia of Computer Science, Third Edition, Anthony Ralson, et al, Van Nostrand Reinhold, New York, 1993, page 396. The issues of redundancy, security, portability and various compression techniques are briefly described therein.
Digital data compression plays an important role in video storage and transmission. The same techniques may also used in other fields, e.g. field of audio signal transmission and storage, sonar signal transmission and processing, geophysical, weather data recording and transmission and the like. Since the bandwidth requirements of video are generally many times greater than that of audio and the others, more attention has been given to compressing digital video data. However, the same techniques may be applied to compression of audio data as a subset of a program containing both audio and video content.
A discussion of video signal compression encoding, as established by the Moving Picture Experts Group (MPEG) is shown in the PCT patent WO 96/36182 by Maturi et al, (Maturi) incorporated herein by reference. Like the JPEG still image compression standard, MPEG is a multistage algorithm built around the discrete cosine transform (DCT). The two algorithms also share similar approaches to color space conversion, blocking, quantization, zigzag ordering, run length tokenizing, and Huffman encoding. MPEG goes further, adding interframe compression and interleaved audio. As with most interframe compression schemes, MPEG is asymmetric--compression requires more effort than decompression.
With reference to FIG. 1, there is shown a simplified block diagram of a portion of a typical MPEG-2 encoding process 10. An incoming uncompressed digital data stream, e.g. video stream 12 is input to Discrete Cosine Transform (DCT) encoder 14, to produce DCT coefficients which are quantized by quantizing means 16. The DCT coefficients are then arranged in zigzag order by zigzag scanning means 18. The ordered DCT coefficients are further encoded (variable length (VLC) and token or Huffman encoding), e.g. encoded by VLC 20. Simultaneously, the uncompressed video stream 12 is being processed in a motion detection and motion vector generation (MVG) means 22 by comparison to previously stored (not shown) digitized video data.
Generally the MVG 22 performs motion estimation (ME) and exchanges ME data 24 with the VLC encoding process 20 and a picture type determining means 26. A final MPEG-2 coded video stream 30 is output for recording or transmission.
Maturi discusses MPEG encoding in general, and describes one example of an MPEG encoder using an encoder decision block (EDB) for compressing/decompressing in the MPEG-II protocol. See in particular FIG. 2 of Maturi reproduced here as FIG. 3, and the description from page 1, line 30 to page 5, line 21 of Maturi.
An exemplary MPEG-2 encoder 200 after Maturi is shown generally in FIG. 2. A Video Interface (VI) 202 receives an incoming digitized video signal (DV) 204. The VI 202 electrically communicates directly with a Motion Estimator (ME) 206, an encoder memory controller (MDC) 207, and is electrically connected to a bus 208. The ME 206 is also connected in direct electrical communication with the MDC 207, and bus 208.
The MDC 207 is also in direct communication with an Encoder Memory (ENC/MEM) that may be, for example, part of a larger Main DRAM (MDRAM) memory 226 (not part of the MPEG-2 encoder 200), the bus 208 and an encode pipe (ENC) 212. An Encoding Decision Block (EDB) 210 is in direct electrical communication with the bus 208, the ENC 212 and a Video System Stream Multiplexor (VSM) 214. The ENC 212 also communicates directly with the bus 208 as does the VSM 214. An Output Stream Interface (OSI) 216 is electrically connected directly to a Host Interface (HI) 218, the bus 208 and outputs an MPEG-2 encoded Output Video Stream (OV) 230.
An Audio Interface (AUD) 220 accepts a Compressed Audio digital stream (CA) 228 and is directly connected to the HI 218 and bus 208. A Host Computer (HC) 222 (not part of the MPEG-2 encoder) is directly connected to the HI 218 which provides communication to a Host DRAM (HD) 224 not part of the MPEG-2 encoder, i.e. a companion memory for storing data and programs used in conjunction with the encoder 200. The HI 218 also connects directly to the bus 208 for communication with all the elements similarly connected to the bus 208.
The EDB 210 retrieves digital video data for a macro block (explained further below) from the encoder memory 226 together with a corresponding motion vector (not shown) from the host DRAM 224, and processes the macro block video data to determine various encoding conditions the encoder 200 will apply in encoding the particular macro block. For example, the EDB 210 decides whether a macro block is to be intra-frame (P) or inter-frame (I) encoded, as well as determining the quantization for a particular macro block. The EDB 210 also causes the motion vector to specify a DCT type of translation field for a macro block, as well as selecting an encoding for the macro block. The encoding selection should produce a pre-specified bit rate at which data will be fed through the remaining components of the encoder chip 200.
The Encode Pipe (ENC) 212 is connected to receive the output from the EDB 210, and actually encodes the macro blocks. The encoding may be performed in accordance with encoding conditions determined by the EDB 210, or alternatively, may be performed in accordance with encoding conditions that are supplied to the encoder chip through the HI 218. The ENC 212 computes the DCT of a block, quantizes the coefficients and performs run length (RLL) and variable length coding (VLC and/or Huffman).
The EDB 210 compresses the digitized video by subdividing successive macro blocks of digital video data into 8.times.8 blocks and processing the blocks.
The AUD 220 is provided to compress digital audio data in accordance with the MPEG-2 protocol, and transmit compressed audio data to the host DRAM 224. The other system components, e.g. the VSM 214, the OSI 216 provide for converting from parallel to serial bit stream and the like. A source of compressed audio data 228 is connected to the audio interface 220 for inclusion in a system output data stream 230 from the output stream interface 216.
Maturi also discusses a motion estimator used for determining the direction and amount of motion between associated macro blocks belonging to related P or B frames. Once a motion vector is determined it is stored in the host DRAM 224 via a host interface. Motion vectors may be used in later calculations, thereby saving microprocessor execution cycles.
FIG. 3 depicts an encoded video image, for example, in the MPEG-2 format. A succession of video frames (302 . . . 306), is comprised of a group of pictures, GOP(J) 300, organized by a coding algorithm into a digitized video pixel serial bit stream, pJ[i]. A typical coding protocol, e.g. the (4:2:0) coding protocol is shown in FIG. 3. The GOP(J)300 is composed of a first frame 302 to a last frame 306, separated by a plurality of frames 304. There are generally three different encoding formats, which may be applied to video data. The frames 304 are generally composed of Intra (I), Predicted (P) and Bi-directional (B) frames.
The format of the GOP structure may vary. The ISO/TEC 13818-2 video specification indicates GOPs can contain one or more pictures or frames. Two typical formats are 12 frames/GOP, derived from film, which is 24 frames per second (fps), and 15 frames/GOP, which is more suitable for video at 30 fps. Other GOP/frame formats may be encompassed by video compression systems, e.g., PAL at 24 fps. Essentially, the choice of the number of frames (Nf) or pictures to be included as the GOP, is independent of the frame rate (fps) of the media from which the video data is derived.
Frames 304-306 are divided into slices, e.g.,slice 307. Slices 307 are divided into Macroblocks 308 that are further divided into Luminance blocks 310 (Y1-Y4) and Chrominance blocks 312 (Cb, Cr). Each of the blocks 310, 312 are comprised of blocks 314 organized from pixels (PJ[i]) 320.
The Intra or I frame coding produces an I block, designating a block of data where the encoding relies solely on information within a video frame where the macro block 308 of data is located, Inter-frame coding may produce either a P block or a B block. A P block designates a block of data where the encoding relies on a prediction based upon blocks of information found in a prior video frame. A B block is a block of data where the encoding relies on a prediction based upon clocks of data from surrounding video frames, i.e., a prior I or P frame and/or a subsequent P frame of video data. One means used to eliminate frame-to-frame redundancy is to estimate the displacement of moving objects in the video images, and encode motion vectors representing such motion from frame to frame.
A GOP will necessarily start with an I frame and continue through a succession of multiple B and P frames 304 until the last frame 306, also generally an I frame. The frames are further subdivided into slices 307 representing, for example, a plurality of image lines. Each slice 307 is composed of macro blocks 308. Each macro block 308 is a block array representing the luminance (intensity) 4 blocks 310 of pixel luminance (Y1-Y4) and two blocks 312 representing interpolated pixel chrominance.
The 4:2:0 macro blocks 308 are further divided into 4 luminance blocks 310 (Y1, Y2, Y3, and Y4) and 2 chrominance blocks 312(Cr, Cb). Each block 310, 312 consists of an 8.times.8 array of bytes, each luminance byte being the intensity of the corresponding pixel, and each chrominance byte being the chrominance intensity interpolated from the chrominance intensity of four adjacent pixels. The MPEG-2 protocol encodes luminance and chrominance data and then combines the encoded video data into a MPEG-2 compressed video bit stream.
The meaning of the video bit stream nomenclature has some ambiguity in the art. For the purposes of this discussion, video bit stream means both `raw` bit streams, i.e. uncompressed (may be encoded or un-encoded) video bit streams, and compressed (may be encoded prior to or after compression) video bit streams (compressed by the methods and apparatus of this invention). In this discussion the terms input and output will indicate which of the two bit streams are concerned. "Input" means uncompressed video bit stream; "output" means a compressed video bit stream.
In addition, an input video bit stream may have previously been compressed by some other means than that of the current discussion. In that case, the term, input, means the video bit stream provided to the particular device or method being considered, prior to the compression of the device or method.
After the video data is encoded it is then compressed, buffered, modulated and finally transmitted to a decoder (not shown) in accordance with the MPEG-2 protocol. The MPEG-2 protocol typically includes a plurality of layers each with respective header information. Nominally each header includes a start code, data related to the respective layer and provisions for adding header information, e.g., see FIG. 5.
With reference to FIG. 3, there is shown a typical arrangement of video data, that comprises a portion, GOP(J), of digital data compressed in the example above. A first portion and second portion of video data stream 102 will typically correspond to at least a first group of pictures (GOP(J)) and second group of pictures (GOP(J+1) not shown) one of which is shown in FIG. 3. The GOPs 300 are composed of a succession of digitized video frames 302-306. The frames 302-306 are of three types: an I frame, a P frame and a B frame.
An MPEG-2 I frame contains all the digital data needed to decode and reconstruct the uncompressed picture for that frame within its own data set without reference to any other data. The MPEG-2 P frames are digital data sets which can be decoded to recreate an entire uncompressed frame of video data by reference only to a prior decoded I frame or by reference to a prior decoded P frame.
The MPEG-2 B frames are digital data sets which may be decoded to recreate an entire uncompressed frame of video data in three ways. B-frames are recreated by reference to only a prior reference frame, to only a subsequent reference frame, or to both a prior and to a subsequent reference frame, (by reference to a future or previous decoded I, P frame or frames).
Each frame 304 is divided into a sequence of slices 306. The slices 306 are further divided into a succession of macro blocks 308. The macro blocks 308 typically separate the luminance and chrominance data into a 4:2:0 encoding scheme. This contains the pixel luminance data (Y data) in four 8.times.8 blocks 310 Y1, Y2, Y3, Y4, and the pixel chrominance data (Cr, Cb) in two 8.times.8 blocks 312 which cover the same area, but which are sub-sampled at half the spatial frequency.
In the typical 4:2:0 sampling scheme for digitized video encoding, each macro block 308 thus consists of a 16 by 16 pixel (or pel) array of digitized luminance video data and two 8.times.8 pixel arrays 312 of digitized chrominance data. If the current frame is to be an I frame, which does not depend on any other frame, the encoding process proceeds to the next step. If the current frame is to be a P frame, however, interframe correlation must be performed first.
For each macro block a reference frame is sought, seeking the best match. Upon finding an exact or near-exact match, only a pointer to the matching pixels must be encoded. This pointer is called a motion vector and typically only requires a few bits of storage space. The search is performed by the motion estimator 206. Often a perfect match is unavailable because objects don't simply float across the screen, but may also rotate, fade, change shape and move toward or away from the viewer. In this case, the search may still provide a partial match from which there may be computed the difference between current and reference pixels. The resulting difference pixels are often highly correlated, and therefore, are amenable to compression.
Blocks coded differentially require more storage than perfectly matched blocks, but still save coded bytes over intra coding. In the event of search results so poor that even differential coding is not practical, the data for that macro block is simply coded as intra-type.
For B frames, a similar process is used, but both previous and future frames are searched for the reference pixels. Having two separate reference frames available yields higher correlation and thereby higher compression.
The motion estimator 206 determines the direction and amount of motion between macro blocks 310 belonging to different frames of the video data being encoded.
After interframe correlation, differentially encoded P and B blocks and those coded as intra-type are fed into a DCT which maps each 8.times.8 block of pixels into 64 frequency coefficients. Each coefficient represents a weighting factor for a corresponding cosine curve.
The 64 basis cosine curves vary in frequency: low frequencies describe the block's coarse structure, while high frequencies fill in the detail. Adding the 64 weighted basis curves together will reproduce the original 64 pixels. By itself, the encoding of the DCT provides no compression. But the lack of extreme detail in most image blocks means high-frequency coefficients are typically zero or near zero.
To increase the number of zero frequency coefficients and reduce the number of bits needed for nonzero frequencies, each coefficient is divided by a quantizer value. Because the eye is less sensitive to errors in high-frequency coefficients, the quantizer values tend to increase with frequency. MPEG uses one quantizer table for intra-macro blocks and another for non-intra macro blocks. The encoder chip can use the default quantizer tables or use customized quantizer tables.
Quantizing causes a loss of image content. Large quantizer values cause more loss (i.e. greater image degradation, lower quality) but also deliver higher compression. This effect can be used to hold the output stream 230 data bit rate to a desired constant value, e.g. to not exceed a channel capacity constraint. On the other hand, if a frame uses more bits than allocated, the quantizer value is adjusted until the bit count falls below a preestablished maximum. MPEG normally provides a quantizer scale parameter, which can be adjusted once per macro block, expressly for this purpose.
The quantized frequency coefficients are zigzag ordered. Zigzag ordering produces long zero runs suited to run-length encoding. In run-length encoding, each run of zeros is expressed as a data token describing the number of zero value frequency coefficients in the run and the value of the nonzero frequency coefficient that ends it. These tokens are further compressed through Huffman coding, which converts each token into a variable length code (VLC). VLCs for more common tokens are 2-3 bits long, whereas VLCs for rare tokens are up to 28 bits long. The final bit stream, consisting mainly of very short codes, is roughly one-third the size of the run token stream.
As the data stream is compressed, MPEG-defined data header packets may be inserted to assist the downstream decoder. Each header begins with a unique 32-bit start code. A sequence header may define frame rate, and other characteristics. A group of picture headers may be inserted before I frames to indicate random access entry points. Picture headers may be inserted to identify each frame and communicate frame-specific information, such as I, P, or B frame type. Slice codes may be inserted at various locations in the frame to give downstream decoders a chance to be re-synchronized if an error condition, such as a corrupted bit stream, is detected. The compressed data combined with the headers make up a fully defined MPEG-2 output video data stream 230.
The audio interface 220 uses the MPEG audio compression algorithm with sub-band filters to divide the audio signal into frequency bins. Fewer bits may be allocated to the less-audible bins, to achieve compression of the audio data. MPEG can typically store CD-quality audio at compression ratios of up to 8:1 or more. Audio compression as high as 20:1 can be achieved with additional audio degradation if desired.
As a final step, the compressed audio bit streams from the audio interface 220 and the compressed video Bit stream from the video system Multiplexor 214 are packetized and combined in the output stream interface 216 to output the final compressed MPEG system stream 230. Time stamps may be inserted to help the downstream decoder separate and synchronize the audio and video bit streams.
Decompression essentially reverses the process described for compression, but requires fewer computations, as there is no need for the pixel searches (i.e. motion vectors).
The encoding decision block 210 performs macro block 310 intra/inter/Quantizing decisions, makes field and frame decisions, and performs rate control, half-pel motion estimation and video buffer verifier (VBV) calculations. The encode pipe 212 encodes macro blocks 310 comprising the digitized video data 204. Based either on results from the encoding decision block 210, or based upon parameters specified via the host interface 218, the encode pipe 212 accepts and processes the digitized video data 204 together with data produced by the encoding decision block 210.
While the motion estimator 206 determines motion vectors and the encoding decision block 210 selects an encoding mode by processing 16.times.16 macro blocks 310 (the macro blocks 310 being composed of four luminance blocks 311 and 2 chrominance blocks 312), the encode pipe 212 transforms the digitized video data and quantizes the transformed video data by processing 8.times.8 blocks 314. Each 8.times.8 block 314 is transformed by computing DCT coefficients, then the DCT coefficients thus computed are quantized to further reduce the amount of data.
If an interframe macro block cannot be adequately encoded simply by motion vectors alone, a DCT is computed of the difference between it's reference frame(s) or field(s) and the macro block's video data, and the DCT coefficients obtained for this difference macro block are quantized. If P frame(s) provide the difference for an inter macro block, then the DCT is computed of the difference between the macro block's video data and the decoded P frame data to obtain the maximum accuracy in the macro block's encoding. The transformed, quantized data thus obtained are then coded into an MPEG-2 video stream using variable length coding (VLC or Huffman). The encode pipe 212 also performs the inverse of these operations simultaneously on P frame encoded data.
The final encoded output video bit stream 230 is output to a transmission or recording media, e.g. satellite, cable, DVD ROM, and the like.
With previous digital data compression systems for real-time compression of video data, digital data is compressed using a variety of mathematical algorithms. The algorithms are processed with hardwired circuits and/or programmable micro-or multi-processing components comprising the encoder 200 connected to memory circuits, typically RAM, ROM and disk. In the design and use of such systems, choices must be made regarding the quality level of the encode video output 230, derived from the compressed data, which will be accepted and the size of the communication channels and/or media on which and through which the compressed digital data 230 will be stored or sent.
For previous systems, this generally demands choosing the maximum available quality level or minimum acceptable quality level of the decoded representation of the compressed data as a constant, limited by the algorithms employed and the fixed compression parameters selected for the particular hardware involved. Alternatively, the maximum encoded bit rate allowable can be chosen, and compression parameter selection strategy fixed so that the quality level will usually be acceptable. This can often lead to the unfortunate consequence that portions of fast moving video scenes are grossly distorted, e.g. gross quantization causing objects to be displayed as unrecognizable blocks (the infamous cubical football). This can occur when the fixed parameters chosen for the selected filter and compression strategy can not meet the hard system limits, maximum bit rate or maximum bit capacity.
A typical limiting case strategy is to program the EDB to do gross quantization of problematic macro blocks, i.e. transmit only DC DCT coefficients for a block or blocks, to stay within the hard limits. Another strategy in non-real time applications, is to do manual post processing, i.e. human intervention by selecting and hand editing scenes.
Neither of these conditions are desirable, in terms of quality or cost. Compression systems that inherently are unable to provide controllable quality levels throughout a video program, having widely varying activity levels will not be able to guarantee acceptably uniform quality in real time. Second, video program material having widely varying video activity generally leads to increased cost (where post processing is necessary, or in number of DVD layers required for long program material). Third, distribution of the compressed video signal through rigidly specified outlets, i.e. through well known and completely controlled receiving hardware/sets may be limited by the bit rate variability of the compression system. Consumers, and consequently the advertisers who must pay for the received signal, may be less than sanguine about receiving programs whose quality level may fluctuate significantly, depending on program content.
In real time systems, i.e. cable and satellite TV, the instantaneous output encoded bit rate can exceed channel capacity, unless a) the overall quality of the decoded video bit stream is 10 lowered by choosing a compression parameter strategy to avoid the possibility of gross quantization during particularly problematic scenes, or b) blocky artifacts may be created during fast motion scene change or the like; in any event sound or picture quality must be degraded to a lower level than desired over portions or all of an entire program.
Typical prior art compression schemes use what is known as panic mode quantization, i.e. transmitting only DC coefficients when an unexpected scene change, scene fade or picture components with very fast motion would otherwise create unacceptable encoded bit rates.
In some prior art MPEG-2 encoder implementations, the encoder is designed to attempt to address these problems by performing measures of encoded quality and bit rate during the encoding of the frames of a GOP.
With reference to FIGS. 2 through 5 one typical prior art MPEG-2 encoding scheme will now be described. Referring to FIG. 4 and FIG. 2, a video buffer 400 may be incorporated in the encode pipe 212. One implementation of a video buffer is composed of two FIFOs, FIFO A 402, and FIFO B 404. An intermediate portion of the encoded video bit stream 420 (processed internally by ENC 212) is switched between an input 412 of FIFO A 402 and an input 414 of FIFO B 404 by input switch S1.
While the input switch S1 is connected to FIFO input 412, an output switch S2 connects an output 418 so that a segment (not shown) of encoded video bit stream data previously stored in FIFO B 404 is output as a corresponding segment of output video bit stream 422. The encoded input stream 420 remains connected to input 412 until a FIFO A ALMOST FULL signal 408 causes the switch S1 to switch the video bit stream input 420 to the FIFO B input 414. Signal 408 also causes the switch S2 to switch the video bit stream 422 to the FIFO A output 416.
The segment of video bit stream data stored in FIFO A 402 while S1 was connected to input 412 is connected by switch S2 through output 416 to the output encoded bit stream 422. Conversely, the switch S1 remains connected to FIFO B 404 until a FIFO B ALMOST FULL signal 410 causes switch S1 and S2 to toggle again. The segment of video bit stream data that has been stored in FIFO B 404 is then transferred out as the next segment of encoded output video bit stream data 422.
This type of video buffer is termed a .quadrature.see-saw.quadrature. buffer since FIFO A and FIFO B alternately store incoming video bit stream data and transmit outgoing video bit stream data while keeping the encoded output bit stream 422 continuous and independent of the bit rate of encoded input bit stream 420. The bit rate of the input encoded video data 420 will in general be highly dependent on the content and complexity of the video input.
The sawtooth graph 430 of FIG. 4 represents the alternate filling and emptying of the two FIFOs 402, 404 from minimum (empty) level 434 to almost full level 436. The encoded bit stream 422 will have an instantaneous bit rate 440 represented by the slope of the saw tooth 430. For highly complex video scenes, e.g. fast motion, fades and chaotic fire and water pictures, the encoded bit rate will be high, i.e. steep slope 440. Conversely, static or scenes with a deficit of detail will have low bit rate, i.e. approaching zero slope.
The changing instantaneous bit rate 440 combined with the size of the FIFOs 402, 404 will result in a variable period 438 between the toggling of the switches S1 and S2. This variable period 438 is typically used by the EDB 210 as a statistical measure, viz. Vbv, in an algorithm (not shown) to change encoding conditions by the ENC 212 or to change filtering conditions (not shown) by the VI 202 as is well known in the art.
The digitized video bit stream 204 of typical prior art MPEG-2 encoder is processed in real time by the encoder 200. Decisions made by the prior art EDB 210 comes from a portion of the video data which has already been sent through the ENC 212 to the VSM 214 and subsequently to and through the OSI 216. Consequently, the effect of a prior art EDB decision takes place on input video data which may have markedly different characteristics from that which preceded it and which generated the statistical measure used for the EDB decision.
This has been particularly true for scene changes and fades that typically may cause drastic quantization, known as .quadrature.Blocky.quadrature. artifacts. Blocky artifacts are typical in previous satellite systems in prior art encoded video since a .quadrature.panic mode.quadrature. strategy must generally be resorted to in order to keep the bit rate of the output encoded video 230 below a maximum channel capacity limit.
In prior art off-line compression systems, an operator has the opportunity to view problematic scenes and by iterative procedures choose a compromise between number of bits and quality level of individual scenes to stay within the constraint of fixed media size, i.e. there are only so many bits into which a movie may be encoded. A lengthy video feature with a high degree of action may require many scenes to have lower video quality than is otherwise desired in order to keep the movie on one reel of tape or one layer of one CD-ROM/DVD or the like.
In either previous art real-time or off-line digital video data compression systems, there is a high labor cost for trading off picture quality and digital data bit rate and/or required bit capacity.
It would be an advantage to selectively and dynamically modify compression of digital video data controllably and continuously within given bit rate and quality constraints. The quality level and/or instantaneous bit rate during short periods of time can then be managed within pre-set limits while maintaining higher quality over the majority of the program. Alternatively, by limiting the instantaneous transmitted bit rate to stay within channel capacity limits while simultaneously managing the quality level of the subsequently decoded compressed video to a predetermined level.
A consumer and the advertiser paying for the program could be assured that the video quality of the compressed program material would have a consistent look throughout the entire program, irrespective of the type of program material. The vendor providing the compression could confidently guarantee the quality level would never degrade below a minimum and would have a high (and guaranteed) quality level for a maximum (and large) percentage of the program, and at the same time, never exceed a channel bit rate limit or a media bit capacity limit.
In previous digital video compression systems, difficulties can also arise when program content changes from film input (frame mode) to video (field mode). This commonly occurs when watching a film on TV, which then breaks to a commercial. The film was digitized at 24 frames per second (fps) and the commercial at 30 fps. This can create a huge error budget instantly, and causes severe picture quality degradation. This cannot be predicted since the programming station may do so at any time. If this break occurs within a GOP, which it typically does, at least one frame may have an unacceptably large bit rate or bit size.
Another area of concern arises during scene changes, especially fades, where all the pixels of each frame may change intensity and/or color over an extended number of GOPS.
It would be an advantage to have a compression method and apparatus, which could smoothly respond to field/frame changes and fast motion while still maintaining a high degree of picture quality.
Another item of concern in digital compression of video images is the unsteadiness of the source material. During the conversion of any one frame of the source material image (which is represented by a 2-dimensional analog of color and intensity; i.e. film, or a video frame) to another representation (in this case a digital signal) the position of the image of any one frame or picture relative to a device which is converting the image from analog to digital, may change. Not only the image as a whole may shift, but the position or shape of edges of the image may vary relative to the image as a whole.
Film that is fed to a telecine during digital video capture may wobble back and forth in a kind of sinusoidal pattern as it passes through the telecine film gate. It typically .quadrature.walks.quadrature. back and forth, at some cadence. The rate of the cadence is dependent on the tension of the film, and many other factors. This is sometimes seen at a movie theater, where the image drifts laterally and vertically across the screen.
Due to wear in the mechanical system of a projector or telecine, the film can move up and down or back and forth. Vertically and horizontally the frame can move +/-4 pixels or more from a reference center. This wobble or walking of the frame causes unnecessary usage of encoding bits for an encoding algorithm, viz., MPEG, because the changes from frame to frame are interpreted as though the entire scene has changed, whereas the scene has just moved side to side or up and down. If the entire image can be moved back to where it should be, then the MPEG encoding doesn't produce unnecessary encoding bits. Global motion from other sources in video camera scenes, for example, unsteady hand held video cameras may also be compensated by the MSF.
Even though the film image for each frame is stationary with respect to its components, and changes otherwise from frame to frame may be minor or non-existent, prior art digitizing schemes can cause digital compression encoders to generate unnecessary bits in encoding such "walking" scenes because they are unable to distinguish the frame to frame wandering or walking of the whole image from complete scene changes. Digital conversion of such unsteady or wandering, scanned images can grossly increase the number of bits required to encode multiple frames of sequential images, even though the image has not changed significantly from frame to frame. It would be an advantage to provide a method and apparatus to reduce the creation of unnecessary encoded bits caused by walking film during digitizing.
Attempts have been made to account for image unsteadiness, or walking. Lingemann, in U.S. Pat. No. 4,994,918 describes a method and circuit for detection and correction of horizontal and vertical errors in image unsteadiness during television scanning of moving pictures. Lingemann discloses a tachometer roll engaging the perforations of the film. Electronic circuitry provides pulse generation and counting in combination with horizontal and vertical image steadiness signals to control the beginning an ending of the scanning period. This requires a tachometer engaged with the film as part of the equipment. This method and circuitry will not accommodate images from video sources derived from film previously converted to video, nor does it address video sources which themselves have unsteady images, e.g. hand held cameras.
In Weiss et al., U.S. Pat. No. 5,510,834 (Weiss), there is disclosed a method of estimation of global error motion vectors which represent unwanted global picture instabilities in a picture sequence in digital video signals. Weiss describes global motion as a combination of true global motion (such as pan and zoom) and unwanted global picture instabilities from, e.g. worn-out film sprocket holes, poorly performing telecine, film stretch, or unsteady camera shots. Unwanted global motion may be a single global error motion vector for each picture or may be a global error motion vector field for the entire picture.
Known methods for estimating global error motion vectors involve using known motion vector estimation techniques such as, e.g., pixel gradient recursion, phase correlation, and block matching (see, e.g., PCT/SE92/00219), combined with spatial and temporal processing of the estimated motion vectors.
Weiss uses a motion vector estimator with an adaptively variable measuring time distance, as well as spatial processing and temporal processing, in order to estimate a sequence of global motion vectors, from which a sequence of global error motion vectors is separated in order to allow stabilization of a picture sequence. Weiss uses a local motion vector estimator with a measuring time distance adapted to a maximum motion vector length and motion vector frequency, combined with spatial processing, and temporal processing. Weiss makes an estimate of a sequence of global error motion vectors based on an estimated sequence of the global motion vectors associated with pictures in the picture sequence.
Weiss method addresses the difficulty in known motion vector estimators to make accurate estimation of both large and small motion vectors, due to the fact that known motion vector estimators use a fixed measuring time distance. In the general case, fixing the time distance will be disadvantageous for either large or small motion vectors depending on which is chosen and the motion vector frequency.
One disadvantage of Weiss method is the necessity for an operator to place flags at selected points, which establish the image portion, which is judged to be stable. That is, active human judgement is required to set a basic image reference point or points from which motion corrections will be made.