The book Digital Video Processing, Prentice-Hall, 1995, by A. Murat Tekalp, along with the committee drafts ISO/IEC CD 11172 and ISO/IEC CD 13818-2, incorporated by reference above, provides a general overview of data-compression techniques which are consistent with MPEG device-independent compression standards. The book JPEG: Still Image Compression Standard, New York, N.Y.: Van Nostrand Reinhold, 1993 by W. B. Pennebaker and J. L. Mitchell, incorporated by reference above, gives a general overview of data-compression techniques which are consistent with JPEG device-independent compression standards. While JPEG is applicable to still images, MPEG is applicable to audio and video image sequence data.
MJPEG is a less formal standard used by several manufacturers of digital video equipment. In MJPEG, the moving picture is digitized into a sequence of still image frames, such as in most representations of moving pictures, and each image frame in an image sequence is compressed using the JPEG standard. Therefore, a description of JPEG suffices to describe the operation of MJPEG. The details of the operation of the MJPEG and MPEG standards are included in the above referenced books and the committee drafts, but for present purposes, the essential properties of MJPEG and MPEG compressions are as follows: Each image frame of an original image sequence which is desired to be transmitted from one hardware device to another, or which is to be retained in an electronic memory, is divided into a two-dimensional array of typically square blocks of pixels. The "transmitter" is understood to be apparatus or a computer program which reads the original image sequence and generates compressed data. The "receiver" is apparatus or a computer program which receives and decompresses the compressed data, reconstructing an approximation of the original image sequence therefrom. In one typical embodiment, the original image frame is divided into blocks, each block comprising 8.times.8=64 pixels from the original image frame. Each individual pixel in the original image frame, in turn, may express a gray scale value, which may be on a scale from, for example, 0 to 255 or 0 to 4095. There is thus derived from each block in an original image frame to be compressed and transmitted a matrix of 64 gray level values, each value relating to one pixel in an 8.times.8 matrix.
Each matrix is then subjected to certain operations for compression. In MJPEG, each image frame is compressed using the standard JPEG compression technique. In JPEG, the first step is to perform a "discrete cosine transform," or DCT. In effect, the DCT changes the image space for the matrix, so that a vector related to the average luminance of all of the pixels in the block is made into an axis of the space. Following the DCT, the coefficients in the original matrix still completely describe the original image data, but larger value coefficients tend to cluster at the top left corner of the matrix, in a low spatial frequency region. Simultaneously, the coefficient values toward the lower right hand portion of the matrix will tend toward zero for most blocks in an image frame in an image sequence.
The top-left entry in each matrix, which represents the average luminance of all pixels in the matrix, is known in JPEG (and therefore in MJPEG) as the "DC coefficient" of the block, with all the other entries in the matrix being known in MJPEG as the "AC coefficients" of the block. In a preferred embodiment of MJPEG, the transmitted DC coefficient of each block is converted to a difference relative to the DC coefficient of the block to the left of the block in the original image frame; this makes the magnitude of each DC coefficient smaller in absolute terms.
Following the DCT step, individual coefficients in the matrix are quantized, or in effect made into smaller numbers, and rounded. Then, the quantized coefficients are Huffman-encoded to yield a string of binary digits, or bits. There may be other lossless compression steps to encode the quantized DCT coefficients, but the final product is a string of bits for each block, each block resulting in a string of bits of a different length.
Under MJPEG compression, each block of each image frame of the original image sequence will result in a string of bits of unpredictable length. A block including more detail is generally more difficult to compress than a block which is relatively smooth. In this sense, active blocks, with more details or sharp edges, are generally encoded using a larger amount of bits. On the other hand, smooth blocks generally demand few bits for its encoding. There is a non-trivial relation between the activity of a block and the number of bits used in the encoding, i.e. the compression achieved.
In MPEG compression, motion information obtained from nearby frames is used as a prediction tool to increase compression of the image frame data. The first step is to choose which "frame type" to use for each image frame. In order to increase compression and quality, three main frame types are defined. One frame type does not use any motion information, while the other two utilize motion compensation derived from neighbor frames to predict the image data blocks pertaining to an image frame. Intra-coded frames (I-frames) are encoded without reference to other frames. Predicted frames (P-frames) are encoded using motion compensated prediction from a past frame. Bi-directionally-predicted frames (B-frames) are encoded using motion compensated prediction from a past and/or a future frame. Having defined the frame types, each image frame is divided into a two-dimensional array, typically square blocks of pixels. In one typical embodiment, the frame is divided into blocks, each block comprising 8.times.8=64 pixels from the original frame. Each individual pixel in the frame, in turn, may express a gray scale value, which may be on a scale from, for example, 0 to 255 or 0 to 4095. Therefore, each block in an image frame to be transmitted, a matrix of 64 gray level values is generated, each gray level pixel value relating to one pixel in an 8.times.8 matrix. This matrix is then subjected to certain operations described next.
It is illustrative to separate the case where frames only contain intraframe encoded blocks from the case of frames which contain interframe encoded blocks. A block is referred to as being intraframe encoded if it is encoded by itself without being predicted by blocks pertaining to preceding or subsequent frames. A block is referred to as being interframe encoded when it does use motion information collected from blocks pertaining to any preceding or subsequent frames. Frames that only contain intraframe encoded blocks are the frames in MJPEG, the I-frames in MPEG and occasionally B- or P-frames that do not contain motion compensation predicted blocks. Frequently, B- and P-frames contain interframe encoded blocks.
If the frame is an I-frame, each original block in the frame is transformed using the DCT without any motion compensation, in a method similar to the method used in MJPEG. Following the DCT, the coefficients in the original matrix still completely describe the original block data, but larger value coefficients tend to cluster at the top left corner of the matrix, in a low spatial frequency region. Simultaneously, the coefficient values toward the lower right hand portion of the matrix will tend toward zero for most original and residual blocks in a frame in an image sequence. Following the DCT step, individual coefficients in the matrix and motion vector are quantized, or in effect made into smaller numbers, and rounded. Then, the quantized coefficients are encoded to yield a string of binary digits, or bits. The encoding method typically comprises a combination of run-length counting and Huffman codes. There may be other compression steps to encode the quantized DCT coefficients, but the final product is a string of bits for each block, each block resulting in a string of bits of a different length.
If the frame is either a P- or a B-frame, for each block in the present original frame, a search for a matching block in a frame in the past (forward motion prediction in a P-frame) or a frame in the past or future (bi-directional motion prediction in a B-frame) is performed. Once a matching block is found, a vector indicating the magnitude and direction of the motion is formed as a string of bits. The motion vector indicates how to find the matching block within the neighbor frame. This method of processing is called motion compensation. The matching block is used as a predictor for the actual block. The difference between original and matching blocks is called a residual. The residual error is encoded using specific coding techniques resulting in a string of bits. If a good match is found, the strings of bits corresponding to the motion vector and residual error are transmitted to the receiver. The residual block is transformed, quantized and encoded into a string of bits using the same method used for blocks in an I-frame. There may be other compression steps to encode the residual quantized DCT coefficients and motion vectors, but the final product is a string of bits for each block, each block resulting in a string of bits of a different length. If no good match is found for the present block among blocks in past and/or future frames, the block is encoded in the same way as blocks in I-frames. In effect, this is equivalent to predict the block as being a block with all zero pixels. Thus, the residual is the original block itself.
In MPEG, blocks in P- and B-frames are classified individually. Classifications are encoded into a string of bits and transmitted to the receiver. The block classifications convey information (to the receiver) about the matching block procedure. Details can be found in the drafts ISO/IEC CD 11172 and ISO/IEC CD 13818-2, incorporated by references above. For the present, it is important to emphasize that three types of information are sent to the receiver encoded as strings of binary digits or bits: motion vectors, residual blocks and classification.
Under MPEG compression, each block of each frame of the original image sequence will result in a string of bits of unpredictable length. A block with more details and with no match in the past or future frames is generally more difficult to compress than a smooth block or a block which has a match in the past or future frames. In this sense, active blocks, are generally encoded using a larger amount of bits. On the other hand, non-active blocks generally demand few bits for its encoding. There is a non-trivial relation between the activity of a block and the number of bits used in the encoding, i.e. the compression achieved.
It is understood that a "block" may correspond to a single tile of an image frame in an image sequence or to any predefined region of an image frame in an image sequence, encompassing multiple colors, or any predefined regions of multiple frames in an image sequence. In the preferred embodiment of the compression application (MJPEG/ MPEG), one or a plurality of blocks of each color separation can be grouped to form larger structures known for those skilled in the art of MJPEG as MCU (minimum coded unit) or, for those skilled in the art of MPEG, as macroblocks. According to the present invention, it is understood that a block may represent one or multiple blocks, or one or multiple MCUs or macroblocks.
As is well known, in a digital printing apparatus, the data associated with black text in an original image will typically require a high contrast printing technique when the image is printed. For example, a halftone or contone image will be optimally printed with slightly different printing techniques, irrespective of the general printing technology (for example, xerographic or ink-jet). It is therefore desirable, when an image is received, to be able to identify specific portions of the image as text, graphics, or pictures. In most cases, the image is available in compressed format. The identification of the said regions may involve decompressing the image and applying a segmentation algorithm. It is desirable to utilize the compressed data directly to ascertain which printing techniques will be applied to each image region.
Similarly, in a digital video processing apparatus, the data associated with smooth motion in an original image sequence will typically require a different processing technique than the data associated with sudden motion. It is therefore desirable, when an image sequence is received, to be able to identify specific portions of the image sequence with particular characteristics. In most cases, the image sequence is available in compressed format. The identifications of the said portions may involve decompressing the image sequence and applying a segmentation algorithm. It is desirable to utilize the compressed data directly to ascertain which processing techniques will be applied to each image sequence portion.
The above-referenced co-pending applications disclose the creation of an "encoding cost map" derived from compressing, and using the encoding cost map for segmenting the original image. The encoding cost map is assembled from data derived from compression of an original image data, where the image data is composed of either a single image or a sequence of images. The encoding cost map can be derived from the compressed data directly, or it can be stored along with the compressed data as shown in the co-pending application. In the encoding cost map, the longer the resulting string of bits from compression of a particular block, the higher the cost function for that block. When assembled in a two-dimensional map, these cost functions form the encoding cost map. Similarly, encoding cost maps can be formed in one-dimensional vectors where cost functions are associated with data temporal instances derived from compression of an original image sequence data. It is an object of the present invention to use the encoding cost map for segmenting the original image sequence into groups of frames with particular characteristics or to segment individual frames into regions with particular characteristics.
U.S. Pat. No. 5,635,982 discloses a system for temporal segmentation of image sequences into individual camera shots in the uncompressed domain. The system detects the individual shots by analyzing the temporal variation of video content.
"Digital Video Processing", by Tekalp, describes common techniques for compression of digital image sequences at pages 419 through 499.
ISO/IEC CD 11172 and ISO/IEC CD 13818-2 describe in detail the functionality and normalize the operations of an MPEG encoding apparatus.
"Scene Change Detection in a MPEG Compressed Video Sequence," by J. Meng, Y. Juan and S.-Fu discloses a scene change detection technique for compressed MPEG bitstream with minimal decoding of the bitstream. Minimal decoding refers to decoding of the bit stream just enough to obtain motion vectors and the DCT DCs.
"Video and Image Processing Systems", by Furht, Simoliar and Zhang gives an overview of the techniques used in both compressed and uncompressed domain scene cut detection.
"Rapid Scene Analysis on Compressed Video," by B.-L. Yeo and B. Liu discloses a scene change detection technique that operates on the DC sequence which can be extracted from MJPEG or MPEG compressed video.
A basic text which describes JPEG (from which MJPEG is derived) and associated techniques is W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image Compression Standard, New York, N.Y.: Van Nostrand Reinhold, 1993.