In recent decades, some image and video compression formats have been widely adopted for the television and cinema production industry based on international standards, such as MPEG-2 (ITU-T H.262), MPEG-4 Part 2 (ISO/IEC 14496-2), and MPEG-4 AVC (ISO/IEC 14496-10 and ITU-R H.264), as well as the latest standard, known as HEVC (High Efficiency Video Coding), approved as MPEG-H and ITU-T H.265. Furthermore, the audio-visual industry has adopted other proprietary compression formats that have emerged as variations of these standards. Such proprietary formats either limit some syntactic elements or constrain some functionalities of the coding algorithm. Among these are the format families known as Betacam, DVCPRO, AVC-Intra (SMPTE RP 2027-2007), DN×HD, (SMPTE VC-3), and ProRes.
The video compression formats used in production environments are characterized by their ability to achieve high video quality free of visual losses, without blocking artifacts nor any other perceptual artifacts inherent to lossy coding architectures. They usually share some parameters, such as 4:2:2 chroma subsampling and an Intra-Frame coding scheme, but some formats also use an Inter-Frame scheme based on motion estimation and motion compensation techniques.
Usually, formats used in the professional production field have high bitrates. For instance, Standard Definition (SD) formats usually use bit rates between 25 Mbps and 50 Mbps, High Definition (HD) formats use bit rates in the range of 50 Mbps to 400 Mbps, and Ultra-High Definition (UHD) formats use bitrates over 400 Mbps and up to 1 Gbps.
Content re-encoding is inevitable in many stages of the production workflow, wherever content stored in a particular compression format must be decoded in order to engage in editing or pre-processing, tasks that occur exclusively in the pixel domain. The content is later re-encoded for storage or delivery. FIG. 1 depicts an example of three steps of re-encoding or multi-generation, where the content is initially captured using a compression format 100, obtaining the first generation (1G) 101, after which content is re-encoded in the editing stage 102, becoming a second generation (2G) 103. Lastly, it is newly re-encoded in a third generation (3G) 104 for delivery to the end user 105.
As is well-known, the re-encoding processes (encoding stage followed by decoding stage), which are typical in the audio-visual production workflow, incur significant quality losses in each successive generations. Such losses are inherent to the transformation encoding scheme, but these re-encoding losses can be reduced if all the encoders in the workflow use the same encoding parameters, such as quantification parameters, Group Of Pictures (GOP), and motion parameters, among others. Even so, the re-encoding process using traditional encoding schemes can cause irreversible losses due to the non-finite arithmetic precision requirements of the transform kernel. Losses are further due to the rounding to integer arithmetic used in the inverse transformation stage in the decoding process, this effect is known as “IDCT-drift” (Inverse Discrete Cosine Transform drift). The new generation of video coding standards, such as H.264/AVC and HEVC (ITU-R H.265), have minimized these losses by using a new integer transform kernel based on approximation to the DCT (Discrete Cosine Transform).
With the aim of reducing encoding losses, the audio-visual industry, through the SMPTE (Society of Motion Picture and Television Engineers), approved the “SMPTE 327M-200. MPEG-2 Video Recording Data Set” standard, which defines the encoding parameters set that an MPEG-2 compatible video decoder (ISO/IEC 13818-2) can extract from the encoded video stream (ISO/IEC 13818-1) for sending to the next encoder, allowing that encoder to repeat the same encoding conditions applied by the previous encoder. These parameters must be sent by the decoder to the next encoder as ancillary data or “metadata,” with the aim that they may be conveniently applied.
The standards “SMPTE 319M-2000. Transporting MPEG-2 Recording Information through 4:2:2 Component digital Interfaces” and “SMPTE 351M-2000. Transporting MPEG-2 Recording Information through High Definition Digital Interfaces” determine the mechanism to transport such metadata on the two lower significant bits of the chrominance components through the SDI and HD-SDI interfaces. This mechanism for transmitting encoding parameters from the decoder to the next encoder in the chain has many drawbacks, which hindered industry-wide adoption. These drawbacks include:                The standard is constrained to the encoding of non-scalable profiles of the MPEG-2 standard, meaning it is inapplicable to encoders in the chain that support standards of higher efficiency than MPEG-2, such as the H.264 and HEVC standards.        Its effectiveness is significantly reduced if the new coding cannot use the previous coding parameters, for example, if modifications to the bitrate or the GOP structure become necessary.        Metadata transport in the least significant bits of the interfaces hinders its integrity along the chain, because many digital video devices process the 10 bits of the chrominance components, including the two least significant bits. Therefore, the metadata information stored in such parameters is lost.        
For this reason, the coding formats used in professional production environments have high compression bitrates, on one hand to achieve maximum quality with the aim of preventing the different video processing operations (rotations, filters, or the insertion of other visual elements) from disturbing the perceptual quality, and on the other hand to provide high robustness against the distortions experienced across the different re-encoding cycles so that the perceptual quality presents neither distortions nor visible artifacts in the final encoding cycle.
Recommendation ITU-R BT.1872 proposes that, for high-definition formats encoded with the H.264 standard, the contribution services between production centers use a bit rate of 21 Mbps for transmission systems with only one re-encoding cycle, but it recommends the bitrate be increased by 66% (up to 35 Mbps) if up to three encoding cycles may occur in the transmission chain.
The “Apple ProRes, White Paper, October 2012” document collects the results of simulations performed with the ProRes 422 and ProRes 422 HQ compression formats. The objective quality losses as measured by the Peak Signal to Noise Ratio (PSNR) are close to 3 dB after the third generation. Although the multi-generation process leads to moderate quality losses in each encoding cycle, other factors in the workflow can significantly increase these losses. The paper “G. Sullivan; S. Sun; S. Regunathan; D. Schonberg; C. Tu; S. Srinivasan, Image coding design considerations for cascaded encoding-decoding cycles and image editing: JPEG analysis, JPEG 2000, and JPEG XR/HD Photo, Applications of Digital Image Processing XXXI, SPIE Proceedings Vol. 7073” analyzes the parameters that can increase such losses in a re-encoding scenario. Several image processes that may be applied in the pixel domain between the decoding and new encoding stages are described, including the conversion of color space (from YCbCr to RGB), color subsampling conversion (from 4:2:2 to 4:2:0), the use of different quantization parameters (variation in QP), the use of a compression format different from that previously used, and spatial shifting of the picture, known as “Spatial Shifting” or “Pixel Shifting.”
Spatial shifts are frequently applied in video-editing processes, such as the image scaling or image filtering used in some digital effects, and can even happen from synchronization mismatches in the production equipment, such as video switchers or digital video effects inserters. FIG. 2 shows a workflow in which a spatial shift effect typically occurs between two consecutive re-encoding generations k 200 and k+1 202 due to a video editing or post-production stage 201.
FIG. 3 depicts another spatial shift effect that can happen in the recoding of an image or video sequence 300, which has been encoded and decoded 301, and which is inserted as a PiP (“Picture In Picture”) 302. The new image includes the PiP with lower resolution (W×H) 303 located on <x,y> coordinates that are not a multiple of the N×N block size of the transform employed.
An example of a spatial shift effect between two consecutive stages of compression or coding 400 and 405 is shown in FIG. 4. Video compression schemes that use the transformational coding technique employing square blocks of size N×N apply such transformations to the whole image, taking as the origin of transformation the upper-left corner of the image 401. The editing and post-production processes in which the spatial displacement 403 can happen are forcedly applied to the decoded video sequence 402. FIG. 4 shows that after editing and post-production 403 there is a horizontal spatial shift to the right in image 404. The transformed block positions after the new encoding 405, marked 406 with solid lines, do not match with the transformed block positions in the previous encoding 400, marked with dashed lines 406. This misalignment between the positions of the transformed blocks in between the two coding stages increases image-quality losses.
The European Broadcasting Union (EBU), aware of the problem of quality degradation resulting from spatial shifting in re-encoding scenarios, introduced spatial shifting between generations in its video coding quality assessment methodology. “Massimo Visca and Hans Hoffmann, HDTV Production Codec Test” describes this methodology, which is called “Standalone chain with processing.” The method evaluates the robustness of different images and video coding formats against the effect of spatial shifting.
“P. Lopez and P. Guillotel, HDTV video compression for professional studio applications, Picture Coding Symposium, PCS 2009, pp. 1, 4, 6-8 May 2009” demonstrates that objective quality losses, measured in terms of PSNR, are close to 2 dB after eight encoding cycles. However, the same report shows that under the same test conditions, if spatial shifting is introduced between each of the eight encoding stages, objective quality losses increase noticeably by around 8 dB, revealing the significant quality loss resulting from this type of processing in multi-generation scenarios.
Another well-known effect that increases multi-generational losses is misalignment between the positions of the GOP used in consecutive generations of encoding and decoding. As is widely known, GOP structures achieve the highest compression efficiency by applying a structure which alternates pictures of types I (Intra-coded picture), B (Bi-predictive picture), and P (Predicted picture). Depending on the compression standard, type B images can be used as reference pictures by other B pictures, composing a hierarchical structure with different layers of prediction. Likewise, such GOP structures commonly use different quantization steps for each of I, P, and B picture types. The I pictures are used as references for the remaining P and B pictures within the same GOP, usually quantified in a smaller quantification step, with the aim of achieving high quality with good prediction for the rest of the pictures that compose the GOP.
The P pictures, which are used as a reference for the B type pictures and the following P pictures, are usually quantified in a step slightly larger than that used for the I pictures. In general, the quality of the P pictures is slightly below that of the quality of the I pictures.
Finally, the B pictures are usually quantized in a step higher than that used by the I and P pictures belonging to the same GOP, since their quality does not affect the prediction of other pictures within the GOP. The B pictures used as a reference for other B pictures use an intermediate quantization step of that used for the P and B pictures, belonging to the lower reference picture layer. Consequently, the B pictures usually obtain lower quality than the I and P pictures belonging to the same GOP.
Generally, the second encoder, which receives the video sequence in baseband format, does not know the structure, length, and start position of the GOP previously used. Consequently, it uses a GOP which configuration may not coincide with those of the previous encoder, causing misalignments in the GOP structures between successive encoding generations. “Pierre Larbier, Keeping Video Quality Pristine throughout the Production Process: 4:2:2 10-bit H.264 encoding, SMPTE technical conference, Hollywood, October 2009,” shows that GOP misalignment increases quality losses by 2 dB after seven generations.
Therefore, a method is needed to improve the quality of an image or video sequence subjected to various re-encoding processes. For example, it would be desirable to have a method that improves the quality of an image or video sequence by avoiding the quality losses experienced as a result of spatial shifting and/or GOP misalignment.