The present invention relates generally to digital signal processing and more particularly to processing for the detection of motion in digitized video image data.
Increased computer processing is required to provide for modern digital services. As an example, the Internet has spawned a plethora of multimedia applications for presenting images and playing video and audio content. These applications involve the manipulation of complex data in the form of still graphic images and full motion video. It is commonly accepted that digitized images consume prodigious amounts of storage. For example, a single relatively modest-sized image having 480xc3x97640 pixels and a full-color resolution of 24 bits per pixel (three 8-bit bytes per pixel), occupies nearly a megabyte of data. At a resolution of 1024xc3x97768 pixels, a 24-bit color image requires 2.3 MB of memory to represent. A 24-bit color picture of an 8.5 inch by 11 inch page, at 300 dots per inch, requires as much as 2 MB of storage. Video images are even more data-intensive, since it is generally accepted that for high-quality consumer applications, images must occur at a rate of at least 30 frames per second. Current proposals for high-definition television (HDTV) call for as many as 1920xc3x971035 or more pixels per frame, which translates to a data transmission rate of about 1.5 billion bits per second. Other advances in digital imaging and multimedia applications such as video teleconferencing and home entertainment systems have created an even greater demand for higher bandwidth and consequently ever greater processing capability.
Traditional lossless techniques for compressing digital image and video information include methods such as Huffman encoding, run length encoding and the Lempel-Ziv-Welch algorithm. These approaches, though advantageous in preserving image quality, are otherwise inadequate to meet the demands of high throughput systems. For this reason, compression techniques which typically involve some loss of information have been devised.
The Karhunen-Loeve (KL) transform is usually identified as the optimal transform for decorrelating the data in the transform domain and packing a maximum energy in a given number of samples. However, there are two generally recognized problems with the KL transform. One, the KL transform is unique for only one class of signals, and two, a fast KL transform algorithm is not known. Accordingly, alternative mathematical transforms have been investigated. The discrete cosine transform is generally used for transform domain coding of video images because a fast discrete cosine transform algorithm exists and the cosine transform has been shown to be virtually identical to the KL transform for numerous practical conditions.
In the traditional discrete cosine transform compression methods, the video frame is divided into a series of non-overlapping blocks. Typically, a block is sixteen pixels wide and sixteen pixels high. The discrete cosine transform of a two dimensional block is implemented by transforming the digital data for the pixels in a first direction and then transforming in the second direction. The resulting cosine transform coefficients include a single term which represents the average signal energy in the block, sometimes referred to as the DC term, and a series of terms, sometimes referred to as the AC terms, which represent the variation of the signal energy about the DC component for the block.
A quantizer is used to reduce the range of the cosine transform coefficients. A quantizer is a mapping from the continuous variable domain of transform coefficients into the domain of integers. Commonly used is the uniform quantizer, which may be specified by a number. The number is divided into each transform coefficient with the resulting quotient rounded to the nearest integer. The quantized cosine transform coefficients are then encoded for transmission over a data channel.
The Joint Photographic Experts Group (JPEG) has created a standard for still image compression, known as the JPEG standard. This standard defines an algorithm based on the DCT. An encoder using the JPEG algorithm processes an image in four steps: linear transformation, quantization, run-length encoding (RLE), and Huffman coding. The decoder reverses these steps to reconstruct the image. For the linear transformation step, the image is divided up into blocks of 8xc3x978 pixels and a DCT operation is applied in both spatial dimensions for each block. The purpose of dividing the image into blocks is to overcome a deficiency of the DCT algorithm, which is that the DCT is highly non-local. The image is divided into blocks in order to overcome this non-locality by confining it to small regions and doing separate transforms for each block. However, this compromise has the disadvantage of producing a tiled appearance which manifests itself visually by having a blockiness quality.
The quantization step is essential to reduce the amount of information to be transmitted, though it does cause loss of image information. Each transform component is quantized using a value selected from its position in each 8xc3x978 block. This step has the convenient side effect of reducing the abundant small values to zero or other small numbers, which can require much less information to specify.
The run-length encoding step codes runs of same values, such as zeros, to produce codes which identify the number of times to repeat a value and the value to repeat. A single code like xe2x80x9c8 zerosxe2x80x9d requires less space to represent than a string of eight zeros, for example. This step is justified by the abundance of zeros that usually results from the quantization step.
Huffman coding (a popular form of entropy coding) translates each symbol from the run-length encoding step into a variable-length bit string that is chosen depending on how frequently the symbol occurs. That is, frequent symbols are coded with shorter codes than infrequent symbols. The coding can be done either from a preset table or one composed specifically for the image to minimize the total number of bits needed.
Similarly to JPEG, the Motion Pictures Experts Group (MPEG) has promulgated two standards for coding image sequences. The standards are known as MPEG I and MPEG II. The MPEG algorithms exploit the common occurrence of relatively small variations from frame to frame. In the MPEG standards, a full image is compressed and transmitted only once for every 12 frames. These xe2x80x9creferencexe2x80x9d frames (so-called xe2x80x9cI-framesxe2x80x9d for intra-frames) are typically compressed using JPEG compression. For the intermediate frames, a predicted frame (P-frame) is calculated and only the difference between the actual frame and each predicted frame is compressed and transmitted.
Any of several algorithms can be used to calculate a predicted frame. The algorithm is chosen on a block-by-block basis depending on which predictor algorithm works best for the particular block. One technique called xe2x80x9cmotion estimationxe2x80x9d is used to reduce temporal redundancy. Temporal redundancy is observed in a movie where large portions of an image remain unchanged from frame to adjacent frame. In many situations, such as a camera pan, every pixel in an image will change from frame to frame, but nearly every pixel can be found in a previous image. The process of xe2x80x9cfindingxe2x80x9d copies of pixels in previous (and future) frames is called motion estimation. Video compression standards such as H.261 and MPEG 1 and 2 allow the image encoder (image compression engine) to remove redundancy by specifying the motion of 16xc3x9716 pixel blocks within an image. The image being compressed is broken into blocks of 16xc3x9716 pixels. For each block in an image, a search is carried out to find matching blocks in other images that are in the sequence being compressed. Two measures are typically used to determine the match. One is the sum of absolute difference (SAD) which is mathematically written as             ∑      i        ⁢                  ∑        j            ⁢              (                  "LeftBracketingBar"                                    a              i                        -                          b              j                                "RightBracketingBar"                )              ,
and the other is the sum of differences squared (SDS) which is mathematically written as       ∑    i    ⁢            ∑      j        ⁢                            (                                    a              i                        -                          b              j                                )                2            .      
The SAD measure is easy to implement in hardware. However, though the SDS operation requires greater precision to generate, the result is generally accepted to be of superior quality.
For real time, high-quality video image decompression, the decompression algorithm must be simple enough to be able to produce 30 frames of decompressed images per second. The speed requirement for compression is often not as extreme as for decompression, since in many situations, images are compressed offline. Even then, however, compression time must be reasonable to be commercially viable. In addition, many applications require real time compression as well as decompression, such as real time transmission of live events; e.g., video teleconferencing.
Applications exist in which pictures need to be taken only on occasion. For example, a security system may have various cameras deployed about a site. Receiving a constant stream of images from each camera is not practical, as the capacity of the storage will be limited, whether in the form of analog video tape or digitized images stored on disk drives. Rather, it would be preferable to acquire images only when there motion is detected in the image. These situations tend not to be frequent and the duration where there is motion tends to be short. Motion detection could be used to trigger the act of acquiring images.
Another use is on the Internet, where people host web pages which use a camera to provide a view to their living room, for example, for the world to see. Typically, it is desired to update the web site with an image only when the scene has changed. Motion detection capability would be useful here.
The foregoing JPEG and MPEG techniques are ideal for the image acquisition half of these applications. However, they are not well suited for detecting motion elements in an image. Motion detection involves an analysis of the image content. The JPEG definition is concerned only with compression of an image, not the content of an image. Though MPEG processing involves the use of a motion estimator operation to find displaced pixels to calculate P frames, the technique is computationally intensive. More importantly, the technique does not provide a true indication of motion. For example, pixel displacement among successive frames can arise as a result of changing light patterns due to a light source being turned off, or turned on.
What is needed is a scheme for detection motion in a series of digitized images. It is desirable to provide a scheme which can quickly detect motion in a video scene. There is a need for a system which can provide image acquisition capability where the image acquisition is triggered by the detection of motion.
A method for detecting motion in digitized images according to the invention includes providing first and second arrays of pixels, representing previous and current video images respectively. A set of previous discrete cosine transform (DCT) blocks are produced, based on a portion of the first array of pixels. Likewise, a set of current DCT blocks are produced based on a portion of said second array of pixels. Each of the current DCT blocks is compared to its corresponding previous DCT block to make a determination whether it should be marked as MODIFIED or not. The comparison is based on portions of the DCT blocks. The algorithm reduces the computational burden of motion detection involving DCT blocks.
A motion detection system comporting with the invention includes a computing device and a source of digital image data coupled to the computing device. The image data corresponds to images of a scene in which the motion is to be detected. The computer includes a computer program which comprises program code to produce an array of luminance data based on a portion of the pixel array comprising an image. The program further includes code to produce a set of discrete cosine transform (DCT) data blocks based on the luminance data. First and second DCT blocks are thereby produced for previous and current images respectively. There is program code to compare corresponding data between two of the DCT blocks, wherein only some of said data between the two DCT blocks is compared and a counter is incremented when the comparison satisfies a first criterion. This code executed for pairs of previous and current DCT data blocks. There is program code to indicate the occurrence of motion in the first and second images when the counter exceeds a first threshold value.