1. Field of the Invention
The invention relates to a method and an apparatus to detect high level white noise in a sequence of video frames to be encoded.
2. Background Art
In digital video or video-audio systems such as video-telephone, teleconference and digital television systems, a large amount of digital data is needed to define each video frame signal. As the available frequency bandwidth of a conventional transmission line is limited, it is necessary to reduce and compress the volume of data, in order to transmit the data through the channel.
In the state of the art there are several methods and techniques known for reducing and compressing the amount of data. Each of these techniques is aimed to provide the best quality of the images and to reduce and compress the amount of digital data at the same time.
One of these techniques for encoding video signals for a low bit-rate encoding system is an object-oriented analysis-synthesis coding technique, wherein an input video image is divided into objects and three sets of parameters. One of these parameter sets is for defining the pixel data of each object, one for the contours and one for defining the motions of each object between the images. The parameter sets are processed through different encoding channels.
One example of such an object-oriented scheme is the so-called MPEG (Moving Pictures Experts Group) phase 4 (MPEG-4), which is designed to provide an audio-visual coding standard for allowing content-based interactivity, improved coding efficiency and/or universal accessibility in such applications as low-bit rate communications, interactive multimedia (e.g. games, interactive TV and the like) and surveillance (see, for instance, MPEG-4 Video Verification Model Version 2.0, International Organization for Standardization, ISO/IEC JTC/SC29/WG11 N1260, March 1996).
According to MPEG-4, an input video image is divided into a plurality of video object planes (VOP's), which correspond to entities in a bit stream that a user can have access to and manipulate. A VOP can be referred to as an object and can be represented by a bounding rectangle whose width and height may be chosen to be smallest multiples of 16 pixels (a macro block size) surrounding each object so that the encoder processes the input video image on an VOP-by-VOP basis, i.e., an object-by-object basis. The VOP includes color information consisting of the luminance component (Y) and the chrominance components (Cr, Cb) and the contour information represented by, e.g., a binary mask.
Also, among various video compression techniques, the so-called hybrid coding technique is known, which combines temporal and spatial compression techniques together with a statistical coding technique.
Most hybrid coding techniques employ a motion compensated DPCM (Differential Pulse Code Modulation), two-dimensional DCT (Discrete Cosine Transform), quantization of DCT coefficients, and VLC (Variable Length Coding). The motion compensated DPCM is a process of estimating the movement of an object between a current frame and its previous frame, and predicting the current frame according to the motion flow of the object to produce a differential signal representing the difference between the current frame an its prediction.
Specifically, in the motion compensated DPCM, current frame data is predicted from the corresponding previous frame data based on an estimation of the motion between the current and the previous frames. Such estimated motion may be described in terms of two dimensional motion vectors representing the displacements of pixels between the previous and the current frames.
There have been two basic approaches to estimate the displacements of pixels in an object. Generally, they can be classified into two types: a block-by-block estimation and a pixel-by-pixel approach.
In the pixel-by-pixel approach the displacement is determined for each and every pixel. This technique allows a more exact estimation of the pixel value and has the ability to easily handle scales changes and non-translational movements, e.g., scale changes and rotations of the object. However, in the pixel-by-pixel approach, since a motion vector is estimated for each and every pixel, thus producing a huge amount of motion vectors to be transferred to the receiver. Therefore it is virtually impossible to transmit all of the motion vectors to a receiver. Also at the receiving end these vectors must be processed when calculating the next frame or picture and thus cause heavy load on the processor of the receiving system.
Using the block-by-block motion estimation, on the other hand, a current frame is divided into a plurality of search blocks. A search block is a block of for instance 16×16 adjacent pixels, so that a frame is divided into a plurality of search blocks. To determine a motion vector for a search block in the current frame, a similarity calculation is performed between the search block in the current frame and each of a plurality of equal-sized reference blocks included in a generally larger search region within the previous frame.
An error function such as the mean absolute error or mean square error is used to carry out a similarity measurement between the search block in the current frame and the respective reference blocks in the search region of the previous frame. The motion vector, by definition, represents the displacement between the search block and a reference block, which yields a minimum error function. A method, wherein a motion vector is determined using a current macroblock (MB) (16×16 pixels) and at least one preceding frame as reference, is referred to as intermode encoding (intermode is removing first temporal redundancy by subtracting current MB info from best match reference info and then spatial redundancy if still exists with the DCT transform).
As a search region, for example, a relatively large fixed-sized region around the search block might be used (the search block being in the center of the search region).
Another option is to—preliminary—predict the motion vector for a search block on the basis of one or several motion vectors from surrounding search blocks already—finally—determined, and to use as a search region, for example, a relatively small region around the center of the—preliminary predicted—motion vector (the tip of the predicted motion vector being in the center of the search region). A method like this, which uses only current MB info for MB coding i.e. not using reference+motion vector is referred to as intramode encoding (intramode removing only MB's spatial redundancy by using DCT transform).