The diffusion of multimediality is more and more driving modern communications to include video as a part of the exchanged information. For example, new communications network architectures and services have been and are constantly being deployed to add video content to conventional, voice telephone calls, both in wired and in wireless, mobile communications. This is for example the case of third-generation (3G) cellular telephony networks, but efforts are also being made to make videotelephony become the standard in fixed, wired telephony networks (the so-called Public Switched Telephone Networks—PSTNs).
Multimediality, and, in particular, video contents are also being exploited more and more in applications for personal computers, e.g. in video games, and in consumer electronics, such as in cameras.
The diffusion of video in all these (and many others, including broadcast television) applications has been made possible by the introduction of digital video, according to which a generic picture or image (throughout the present description these two terms will be regarded as synonyms and are to be intended as interchangeable), for example, a frame/field of a video sequence, is subdivided into a finite number of “picture elements” (or “pixels”), each one corresponding to a respective small area of the image and having associated therewith (in terms of bits, i.e., strings of logic “1”s and “0”s) information about, e.g., the luminance and the chrominance of that image area.
Huge amounts of data may be required to properly describe an image (even a still one, not to say a motion sequence) in terms of pixels; thus, coding schemes have been proposed that allow reducing the otherwise not practically manageable amount of information to be transferred/stored for communicating/saving a video sequence.
Reducing as much as possible the amount of data to be handled (transferred over a transmission channel or stored in a storage device) is of paramount importance, because the capacity of the transmission channel and/or of the storage device used for delivering/storing the video is usually limited, and has an intrinsic cost. However, the reduction of the amount of bits to be transmitted/stored should impact as less as possible the quality (measured and perceived by the users) of the video once it is reconstructed (decoded) for being enjoyed.
In particular, a class of digital-video coding schemes (sometimes referred to as “block-coding schemes”), that allows to keep the amount of data to be transferred reasonably low, calls for dividing a generic image frame or field (like a picture of a video sequence) into a plurality of image blocks, each image block including a subset of the pixels of the overall image (for example, an 8×8 or 16×16 pixels matrix); the image blocks are then each one processed adopting compression techniques. Exemplary coding schemes of this kind are known under the name of H.264, MPEG4-10 or AVC (Advanced Video Coding).
In the above-cited video coding algorithms, a substantial reduction of the amount of data to be handled is achieved by means of the so-called “differential coding” technique: the generic image block under consideration is compared with other image blocks (referred to as “prediction image blocks”), which, using suitable prediction techniques, may be derived from the same image frame/field (so-called “intra” coding mode) or from previously-transmitted image frames/fields, modified by motion-compensation techniques (so-called “inter” coding mode). The difference between the image block under consideration and the generic prediction image block is then typically transformed into the spatial-frequency domain (using transform functions like the Discrete Cosine Transform—DCT), scaled and quantized, and finally converted into binary data (a string of bits), adopting a so-called “entropic coding”. Among the different prediction image blocks, the one involving the best trade-off of amount of data to be transmitted/stored and/or the quality of the reconstructed image is selected; the binary data corresponding to the transformed, scaled, quantized and encoded difference between the selected prediction image block and the image block under consideration are transmitted, or stored.
The compression gain, in terms of reduction of the number of bits to be handled, is achieved thanks to the entropic coding, and such a gain is higher the better the prediction (i.e., the more the prediction image block resembles the image block under consideration), and the more the scaling and quantization phases allow reducing the entropy of the image data.
The maximum gain is achieved when the prediction is so good that (at the predetermined level of quality) the image block under consideration is substantially indistinguishable from one of the prediction image blocks. In such a case, it is not necessary to transmit data indicative of the difference between the current image block and a selected prediction image block: it may be sufficient to encode and transmit the (auxiliary) information adapted to determine which is the selected prediction image block. Under the assumption that the correct prediction image block can be identified automatically in the decoding phase, it may be even not necessary to transmit the above-mentioned auxiliary information: in this case, referred to as “skip-mode coding” or, simply, “skip mode”, the coding of the image block is achieved free of transmission/storing cost. The skip-mode coding is also referred to as a “zero-bit coding”, since no decoding information needs to be transmitted/stored for enabling the image decoder reconstruct the image block; at most, an indication of how many consecutive image blocks have been “skipped” is sufficient to the image decoder.
Regretfully, even adopting sophisticated motion-compensation techniques, it is not always possible to find out, for any generic image block being processed, a corresponding prediction image block that allows relying on the skip mode. However, a good encoder should be capable of detecting and applying the skip mode to a high number of image blocks (possibly the maximum). This allows reserving the transmission channel/storage medium capacity to the transmission/storage of the other image blocks, for which the skip mode does not ensure acceptable results.
Ideally, a video encoder should perform, for every image block in which the image (e.g., frame of a video sequence) is subdivided, the following actions: determining all the different possible prediction image blocks in respect of the image block under processing; for each prediction image block, calculating the difference with the image block under processing, applying the transform function, then scaling/quantizing the transformed data and calculating the entropy encoding thereof, so as to establish the cost, in number of bits, associated with that prediction image block; based on the evaluated costs in respect of the different prediction image blocks, the video encoder should determine the best prediction image block, which is the one involving the minimum cost in terms of number of bits or, in more recent encoders, the prediction image block that optimizes the so-called “rate/distortion” factor (in evaluating a generic prediction image block, the quality—signal/noise factor—of the reconstructed image is also taken into consideration); the possibility of relying on the skip mode should then be evaluated, and the best prediction image block should then be compared with the result achievable adopting the skip mode (in terms of the trade-off between the number of bits to be handled, and the quality of the decoded image), so as to establish whether the skip mode implies a worsening of the image quality, and, in the affirmative case, whether the image quality worsening is acceptable at both an image local level (the quality worsening may for example cause unacceptable artifacts in the reconstruction of the video sequence) and at a global image level (tolerating a quality reduction in respect of the image block under processing may for example free more bits that may be used for transmitting/storing other image blocks, and thus the overall quality of the reconstructed image might improve).
All these operations are extremely heavy in terms of processing power, and the processing time may easily become unacceptably long. On the contrary, in practical applications the time available for searching the best prediction is usually limited, especially in real-time applications. Sometimes, typically in the case of portable, battery-powered devices, the processing has also to be limited for reasons related to the power consumption of the hardware devices implementing the encoder.
A substantial part of the processing time needed for identifying the better prediction is spent for calculating the spatial transform of the difference between the current image block and each prediction image block, scaling/quantizing the same and performing the entropic coding.
In view of this, a technique has been proposed and adopted that allows estimating the encoding cost associated with each prediction image block without the need of preliminary performing the transformation into the spatial frequency domain of the difference between the current image block and that prediction image block, the scaling/quantization and the entropic coding. Such a technique, described for example in U.S. Pat. No. 6,473,529, calls for evaluating the difference, pixel by pixel, between the image block under consideration and the generic prediction image block, and then adding the calculated pixel-by-pixel differences over all the pixels of the image block under processing. In particular, the sum of the pixel differences is calculated on the absolute value (so-called Sum of Absolute-Differences, or SAD), or on the squares (Sum of Squared Differences—SSD—or Mean Square Error—MSE) of the pixel-by-pixel differences, in order to avoid that positive and negative differences mutually cancel.
The above-mentioned technique is based on the assumption that a smaller value of the calculated SAD for a generic image block corresponds to a smaller number of bits required for encoding (the difference between the selected prediction image block and) the image block under processing.
By adopting the SAD technique, the search for the best image predictor involves the following actions: performing, on every image predictor block, the SAD calculation; based on the calculated SADs, determining the candidate prediction image block among the image prediction blocks as the one for which the corresponding calculated SAD is the lowest among the calculated SADs in respect of all the possible prediction image blocks; calculating the SAD between the image block under processing and the prediction image block corresponding to the skip-mode coding; comparing the SAD of the candidate image block with the SAD for the prediction image block corresponding to the skip-mode coding; and, depending on the outcome of this comparison (possibly weighted so as to favor the skip-mode coding, which for sure allows saving bits) establishing whether to transmit/store the (encoded difference between the selected prediction image block and the) image block under processing, or skipping it.
The adoption of the SAD technique allows reducing the required processing power, because the calculation of the spatial transform, the scaling/quantizing and the entropic coding are performed only once, after having determined the best prediction image block (and not on every prediction image block, as in the first approach discussed above). Nevertheless, it per-se does not avoid the necessity of determining all the candidate predictor image blocks for the image block under consideration, and to calculate the SADs for each one of the different candidate prediction image blocks. This operation may become rather heavy in terms of processing power: video encoders implementing this technique may spend most of the time calculating the SADs for the several candidate prediction image blocks associated with a generic image block.
A different approach is adopted in WO 2004/056125, in which the authors propose to reduce the computational complexity of video encoding by taking the decision whether to encode a region of a video frame or to skip the encoding prior to calculating whether any motion has occurred in respect of the same region in the previous frame. In one embodiment, the decision on whether to skip the encoding of a region is based on an estimate of the energy of pixel values in the region and/or on an estimate of discrete cosine transform coefficients. In a further embodiment, the decision is based on an estimate of the distortion likely to occur if the region is not encoded. The Applicant observes that the algorithms disclosed in the WO '125 patent application are based on SAD calculations.