The present invention relates to digital video signal processing, and more particularly to film mode and bad edit detection as is useful in de-interlacing video fields.
For moving picture systems, interlaced video format is widely used to reduce data rate. That is, each image frame consists of two fields, each of which contains samples of either the even numbered (top field) or the odd numbered (bottom field) lines of the image. In interlaced scan, fields are scanned and displayed sequentially, as shown for a 5×5 pixel portion in FIG. 7A. By taking advantage of the time it takes for an image to fade on a CRT, interlaced video gives the impression of double the actual refresh rate, which helps to prevent flicker that occurs when the monitor's CRT is driven at a low refresh rate, and allows the screen's phosphors to lose their excitation between sweeps of the electron gun. Interlaced scan achieves a good tradeoff between frame rate and transmission bandwidth requirements. However, when displaying video on a display that can support a high enough refresh rate so that flicker is not perceivable, progressive scanning is more preferable, since interlacing reduces the vertical display resolution and causes twitter effects for displaying pictures with high vertical frequency. In progressive scan, all frames as a whole are scanned and displayed continuously, as shown in FIG. 7B. Note that one frame, shown as in FIG. 7B, consists of two fields shown in FIG. 7A.
De-Interlacing
Due to the increased popularity of progressive displays, such as high-performance CRT/LCD/DLP/LCOS projectors, the new HDTV-ready TVs, and PC monitors, which can show progressive scanned images as opposed to interlaced, there is a need to display interlaced video on progressive displays. Thus, the function of converting interlaced video to progressive video, which is called de-interlacing, is very desirable. The task for de-interlacing is to convert the interlaced fields into progressive frames, which represent the same image as the corresponding input field but contain the samples of the missing lines as well. This process is illustrated in FIG. 7C, where the dash lines represent the missing lines in the interlaced video.
Mathematically, for a given interlaced input pixel values F(j,i,n), the output pixel values from de-interlacing, Fo(j,i,n), can be defined as
            F      o        ⁡          (              j        ,        i        ,        n            )        =      {                                                      F              ⁡                              (                                  j                  ,                  i                  ,                  n                                )                                      ,                                                              mod              ⁡                              (                                  j                  ,                  2                                )                                      =                          mod              ⁡                              (                                  n                  ,                  2                                )                                                                                                                    F                ^                            ⁡                              (                                  j                  ,                  i                  ,                  n                                )                                      ,                                    otherwise                    where j, i, and n are the vertical, horizontal, and temporal index, respectively, {circumflex over (F)}(j,i,n) is the estimation of the missing lines generated by the de-interlacing method, and F(j,i,n) is the pixel value from the original interlaced field. That is, the existing, even or odd, lines in the original fields are directly transferred to the output frame.
Thus de-interlacing is a line-rate up-conversion process. From the signal processing point of view, the objective of de-interlacing is to preserve the baseband spectrum and suppress the “alias” introduced during up conversion as much as possible. This is not a straightforward linear upsampling problem, however, since TV signals do not fulfill the sampling theorem constraints (vertical prefiltering usually is not employed when the sensors in the camera sample the scene).
There are various ways to calculate the missing pixel {circumflex over (F)}(j,i,n). Generally speaking, spatial (intra-frame), temporal (inter-frame), and spatial-temporal de-interlacing algorithms are simple and usually lead to poor conversion performance. Motion adaptive techniques are generally advantageous but of much higher complexity in implementation. Nevertheless, neither of these techniques can fully recover the lost information caused during interlacing because interlacing is a non-reversible procedure.
But the task of de-interlacing will be simple if the sources are progressive in nature. For example, most movies stored on DVD have an original source in the form of progressive frames. However, in order to be displayed on an interlaced scanned CRT TV, the sources are encoded as interlaced fields and then stored on DVD. This process to convert progressive frames into interlaced fields is called Telecine. During Telecine, the original progressive frames are divided into halves, thus no information is lost.
Differing from the native NTSC interlaced video material, where each field represents a unique snapshot in time, the two fields generated by Telecine are snapshots obtained at the same time instance. If which two fields belong to one frame can be correctly identified, we can recover the original film without any loss (of course, without introducing any artifacts as well).
NTSC Telecine (Conversion of 24 fps Film to 60 Hz NTSC TV)
Motion picture photography is based on 24 fps (frames per second). As NTSC TV standard runs at 60 interlaced fields per second, Telecine uses a process known as 3-2 pulldown to create 10 video fields from 4 film frames (24/4*10=60). This form of Telecine alternates between creating 3 fields from a film frame and 2 fields from another film frame, as shown in FIG. 7D.
Re-Interlacing 24 fps Film
For every film frame that had three fields made from it, the third field is a duplicate of the first, as shown in FIG. 7E. As discussed above, the objective of de-interlacing for film source is to correctly identify which two fields originated from one film picture and assemble them into one progressive frame. If our goal is to recover the film source and display it at its original rate of 24 fps, our job is done. However, if the reconstructed progressive frames need to be displayed at the speed of 60 frames per second, the progressive output should assemble 2 fields from each film frame and create a complete progressive frame that looks just like the original film frame, that is, we need to construct 5 frames from 5 fields, which were created from 2 film pictures during Telecine process. FIG. 7E illustrates the re-interleaving procedure, which alternates between doubling and tripling each frame (1, 1, 1, 2, 2, 3, 3, 3, 4, 4). Such interleaving pattern gives rise to the name “inverse 3-2 pull down” for this procedure.
De-Interlacing Other Sources
The above discussions cover how to generate 60 fields per second video sequences from 24 frames per second film source (Telecine) and how to recreate the 60 frames per second progressive video from it (de-interlacing). The two processes are also named 3-2 pull down and inverse 3-2 pull down, respectively, due to the represented fields/frame pattern during the conversions.
Besides 24 Hz film source, we also discuss another type of film source: True 30 frame per second material. For True 30 frame per second material, the same as for 24 Hz film material, interlaced fields are generated from the progressive film source and then stored. The two fields originated from one frame represent snapshots at the same time instance. If you want to recreate the original film frames from the interlaced video, we need to detect which two fields belong to the same progressive frame. As we are converting 60 fields per second sequence into 60 frames per second sequence in de-interlacing for this type of material, the field pattern is 2-2-2-2 as oppose to 3-2-3-2 for 24 Hz film source. For this reason, we name this type of de-interlacing 2-2 pulldown, where pairs of fields need to be woven together and each resulting progressive frame displayed twice.
Based on the above discussions, we can see that the key in de-interlacing for both types of film materials is to detect which two fields belong to the same progressive frame. This technique is called 3-2 pull down detection and 2-2 pull down detection, for the 24 Hz and 30 Hz film materials, respectively.
General Techniques for 3-2 and 2-2 Pull Down Detection
Based on the above discussion, different from regular interlaced sequences, where all fields are snapshots taken at different time instances, the two fields that originated from one film frame represent snapshots at the same time instance. This difference will be used to distinguish film source from regular interlaced source.
It is easy to understand that the two fields representing the same time instances are more correlated (similar) than two representing different time instances. Hence, for 2-2 pull down film source, as shown in FIG. 7F, if we measure the correlation (or differences) between the neighboring fields, the resulting correlation should follow in the pattern of “strong, weak, strong, weak, . . . ”, where strong correlation is associated with the two fields that are originated from one progressive film frame. As shown in FIG. 7F, we can use the differences of two fields to denote the level of correlation and compare those field differences with some threshold, then the comparison results will be in the pattern of “1, 0, 1, 0, . . . ”, if the source is 2-2 pull down.
As for 3-2 pull down source, as mentioned above, one field from every other progressive film needs to be repeated during telecine procedure, in order to meet the required field rate for the resulting interlaced video. For example, as shown in FIG. 7F, Field 3 and Field 5 should be the same fields originated from film frame 2. Note that the old numbered fields in FIG. 7F should be with the same field parity, that is, they are all odd fields or even fields. So if we measure the field differences of two neighboring fields with the same parity, the difference between field 3 and 5 should be very small, so is the difference between field 8 and 10, and between field 13 and 15. The other differences should be much larger compared to those small differences. Thus, for 3-2 pull down detection, we usually calculate the differences of two neighboring fields with the same parity and compare these differences with a threshold. If it is a 3-2 pull down film source, the comparison results should follow the pattern of “1, 1, 1, 1, 0, 1, 1, 1, 1, 0, . . . ”.
Challenges
So far, the task of 2-2 pull down and 3-2 pull down detection seems rather simple. It is not quite true, however, due to a few challenges as set forth below.
(1) The two neighboring fields used for comparison in 2-2 pull down detection have different field parity (that is, one is an odd field and the other one is an even field), thus, they always correspond to snapshots at different spatial locations. For this reason, although the two fields originated from one film frame represent snapshots at the same time instance, their difference may not be as small as expected. This makes the comparison results harder to follow the “1, 0, 1, 0, . . . ” pattern, even though the source is indeed 2-2 pull down, since the small differences and large differences may not be that distinguishable.
(2) Even if the comparison results of neighboring fields with the same parity follow the “1, 0, 1, 0, . . . ” pattern, it is still not guaranteed that the detected video is a true 2-2 pull down source, because in theory, an interlaced video sequence may also have the same pattern.
(3) As for 3-2 pull down, as mentioned above, the repeated fields (e.g., field 3 and 5 in FIG. 7F) should be exactly the same in theory, as they are the same field originated from film frame 2. This is true if field 5 is not stored during the Telecine procedure. The MPEG-2 standard defines a flag called “repeat_first_field” to handle this. If such a flag is detected during decoding, field 5 can be repeated using field 3 at the receiver, so these two fields will be exactly the same. However, if field 5 is encoded and stored, although the original sources for field 3 and 5 are the same, the resulting two reconstructed fields are different because of the use of different lossy compressions (e.g, the rate control may assign different quantizers to these two fields, or they may have been assigned different reference frame during motion compensation). In addition, if the video is decoded and transmitted to the receiver through an analog channel, the introduced transmission noise will make these two fields quite different. All these possibilities bring difficulties in identifying the two fields that are supposed to be the same in theory.
(4) The techniques for 3-2 and 2-2 pull down detection discussed in the previous section are for ideal sources. In reality, however, there exist plenty of mixture that consist of both film source and interlaced video source due to video editing. For example, when a movie is transferred to video for broadcasting or distribution on DVD, an entirely new electronic end title sequence may be created. Or when the movie is displayed on TV, the added weather alert broadcast is usually 30 fps interlaced video. In this case, the film mode detector may be confused when it tries to detect and hold a 3-2 or 2-2 sequence.
(5) During video editing, film can be concatenated with any other source such as a video source or another film source, which may cause the original cadence to break. You might get a 2-2 or 3-3, or 4-1 cadence to name just a few of the possibilities. Errors occur during transition from one source to another source if the same cadence is still used for re-interleaving. These errors will show up as artifacts on screen. The most common artifact, a comb, happens when the video processor combines two fields of video that come from two different frames of film. FIG. 9 shows an example of what a comb would look like on screen. The functionality to detect such transition is called bad edit detection. If such field transition is detected, the processor will switch to the real video de-interlacing method instead of using re-interleaving. All de-interlacing methods switch between film and video, but the strength lies in how quickly you can detect the error and switch. Many de-interlacers switch after it is too late. The goal is to switch to video mode before an artifact is observable and to switch back to film mode as quickly as possible.
Challenges 1 and 3 mainly relate to film mode detection itself, and challenges 2, 4, and 5 directly relate to bad edit or mixed content detection. Next we briefly discuss some conventional 3-2 pull down and 2-2 pull down detection algorithms and implementations.
3-2 Pull Down Detection
As discussed above, the cadence of field differences between two successive fields of the same parity follows a particular pattern, if the source is 3-2 pull down. The field differences can be directly calculated as shown in FIGS. 8A-8B, where the schemes search for the particular pattern of the resulting field difference cadence. Field differences can also be indirectly measured using other characteristics such as motion vectors if they are available (e.g. from the MPEG bitstream).
As these implementations solely rely on the detected cadence of field differences, they are incapable of handling the aforementioned challenges such as mixed content and bad edit, even though they usually can handle pure and clean 3-2 pull down source very well.
As for 2-2 pull down detection, FIG. 8C shows one example implemented in U.S. Pat. No. 6,859,237. The field difference (the difference between two neighboring fields with different parity) is put into a block called field rate accumulator which accumulates the field differences of one field. Its output, A, and its one-field delayed output, B, are then compared. In order to be robust, A and B are not directly compared. Instead, the relative difference, i.e., their difference divided by their average is used to compare with a threshold, the minimum ratio. If the relative difference is greater than the chosen minimum ratio, the field difference comparison results, i.e., the output of the AND gate, will be 1 or 0. The sequence of the field difference comparison results will be sent to a state machine, which searches for “01” pattern as discussed above, to decide if the sequence is a 2-2 pull down film source and the phase of each field (i.e which two fields belong to one progressive frame) when it is.
Film Mode Detection Employing Combing Artifacts Detection
Combing artifacts detection has been employed in film mode detection with the goal to identify bad edit or mixed content edit. For example, in U.S. Pat. No. 6,859,237, a sawtooth artifact detector is employed to detect bad edits. The goal there is to detect the mixed content (e.g., 60i video overlaid on 24 Hz film), but the technique can be directly used for bad edit detection as well.