For many decades the film industry has used a picture or frame rate of 24 images per second (i.e. 24 Hz) to capture moving material. When television technology was emerging in the 1930s, different frame rates were chosen despite the existence of this early film standard. Indeed, whilst the film standard of 24 Hz was effectively a worldwide standard, television picture rates polarised into two standards based on the frequencies of the power distribution systems.
In Europe, and many other regions of the world, the power frequency is 50 Hz, and hence in these regions the television frame rate is usually 25 Hz. This is because television is primarily an interlaced display format, i.e. one that displays alternating fields (TOP and BOTTOM) to produce the moving picture. This is shown more clearly in FIG. 1. Therefore, for television, when the interlaced field display rate is at the frequency of the power system, i.e. 50 Hz, the resultant frame rate is half that, namely 25 Hz.
The difference between the two frame frequencies (24 Hz versus 25 Hz) is small and so film can be easily presented on European televisions at 25 Hz by running the film slightly faster than normal (about 4.16% faster). In this case, although the sound pitch is almost a semitone sharp, viewers still find it acceptable. Thus there is no significant problem associated with presenting film in the 50 Hz television environment.
However, in the United States of America, and some other regions where the USA's power frequency has been adopted, the power frequency is 60 Hz, so the field rate of televisions in these regions is 60 Hz, and the frame rate is 30 Hz. This latter frame frequency was modified to become 29.97 Hz when colour television (the system called National Television System Committee, NTSC) was adopted in the 1950s and this frame frequency has been used ever since.
The discrepancy between the original 24 Hz film rate, and the current NTSC frame rate of 29.97 Hz, is approx 25%, and as a result of this large discrepancy in frame rates, the presentation of films on NTSC television (or other 60 Hz television formats) is not straightforward, and it cannot be done simply by speeding up the film playback rate.
As mentioned in brief above, all current standard definition television systems use a system of interlace, where each complete television frame is transmitted in two parts called fields. FIG. 1 shows how a video frame 10 is formed of two fields; a TOP field 20 comprising the odd lines, and a BOTTOM field 30 comprising the even lines.
The first field to be transmitted comprises the odd numbered lines of the raster scan and the second field to be transmitted comprises the even numbered lines of the raster scan.
FIG. 2 shows four frames of a film sequence having eight fields, i.e. the normal sequence of fields in an exemplary four frame long portion of video.
The method of mapping the 24 frames per second that are captured during film production onto the 29.97 frames per second of an NTSC TV system is called “3:2 pull down”. This involves the exact repetition of some of the fields of some television frames in a pattern that ensures that 24 film frames occupy the same time as the 30 corresponding television frames, i.e. 4 film frames to each set of 5 television frames. The small difference between 29.97 and 30 Hz is not material to this process.
In the 3:2 pull down process, the 24 film frames (per second) are distributed to 30 video frames (per second), arranged as 60 separate television or video fields. At the start of each 3:2 sequence, a given frame of the film is scanned and placed on 3 fields of the video output, where the third field is a repeat of the first and comprises the first field of a new television frame. Then, the next film frame is scanned and placed on the next two television fields, but in reverse order; that is, the bottom field goes first to complete the second television frame and the top field goes last to form the first field of a third television frame. The third film frame is scanned and placed (i.e. mapped) on to the next three fields with another repeated field, with the fourth film frame being on two television frames, in reverse order, etc. The result is that every 4 film frames are converted in to 5 television frames (or 10 fields of NTSC interlaced video) in a pattern of television field distribution like that shown in FIG. 3.
The application of video compression to such a sequence of television fields presents specific problems. For example, in sequences which have had a 3:2 pull down process applied, if frame based coding is used, then a quarter of all television frames would consist of two fields from different source film frames, leading to poor compression and unwanted visible artifacts. Meanwhile, if field based coding is used, two redundant fields caused by the field repetition are unnecessarily re-coded in every 10 fields, resulting in inefficient compression.
Consequently, the syntax of MPEG2 digital video provides means to avoid this inefficiency using flags, so that such sequences are coded at 24 frames per second and the field repetition is performed at the decoder. These flags are: Repeat First Field (RFF) and Top Field First (TFF). Similarly, Supplemental Enhanced Information (SEI) messages are used in H.264 to indicate repetitive fields. So it is highly desirable to be able to reliably detect the repetitive fields in video sequences to improve the compression achieved, as well as to improve the perceived video quality.
Some 3:2 detection schemes exist that require prior indication of the format of the input video, and assume that all the video is consistently in the same format and quality. Thus, in known practical hardware for specifically detecting the 3:2 sequence, it would be known de facto that the input is already in that form.
However, this pre-indication of the type of input video is not much use in a pre-processing environment of a video compression encoder, where the video signal format of the sources can be arbitrarily mixed in nature, i.e. it may contain inserts from several different sources edited together, where some material is derived from film stock and has been though 3:2 conversion from film, other portions are native television video portions which are naturally interlaced, and still further portions may be progressive in origin. The pre-processor must therefore be able to deal with all these format characteristics with no prior knowledge of the signal format or behaviour, and without degrading the picture quality.
For sequences without noise, repetitive fields can be easily detected in interlaced video by computing the temporal difference between fields having the same parity—that is either Top or Bottom field, in adjacent frames. For example, as shown in FIG. 4, comparison of repetitive fields produces zero temporal difference (40a), whereas comparison of non-repetitive fields produce non-zero temporal difference (40b), thereby providing the opportunity to identify a definitive repetitive pattern in temporal difference values over a 10 field period, and hence the 3:2 pull down sequence can be identified.
However, in the presence of noise, as well as other impairments to the video sequence, such as film weave, etc, the temporal difference between repetitive fields will not be zero. The difference value will depend on the variance of these accumulated impairments, dominated by the noise.
For sequences with slow motion activity and noise, merely observing the values of temporal difference could lead to false triggers, or missing triggers of the repetitive fields. This is because the non-repetitive fields could also produce small temporal difference values for such slow motion sequences. Any existing solutions which operate in the time domain by analysing differences between repeating fields are subject to degradation due to noise, hence it is highly desirable that any new repetitive field detection method provides improved performance that is highly robust against noise for slow motion sequences.
Although there exist methods to detect the 3:2 pull down sequences, no method efficiently detects the repetitive fields in the presence of strong noise levels typically present in the source video sequences.
For example, U.S. Pat. No. 6,058,140, uses the combination of magnitude and correlation between motion vectors of respective blocks in adjacent fields to identify the 3:2 pull down pattern.
U.S. Pat. No. 5,852,473 uses circuitry in which current and prior top and bottom fields are input to respective Sum of Absolute Difference (SAD) circuits, the outputs of which are then subtracted to produce a difference signal. The difference signal is input to a spike detector which detects and filters out spikes caused by scene changes and which detects the maximum and minimum values of the difference signal over the last five frames. These maximum and minimum values are further correlated and processed to generate a flag signal indicative of 3-2 pull down material.
U.S. Pat. No. 7,203,238 uses field motion vectors determined by a motion estimator and compares the field motion vectors to a threshold to determine whether a repeated field exists.
U.S. Pat. No. 7,180,570 uses fuzzy logic to detect 3:2 pulldown sequences, but uses a field or frame difference and SAD based method. The fuzzy logic looks at the patterns of fields expected of 3:2 pulldown sequences over the five frames systematically involved.
All the above mentioned methods do not take the noise levels present in the input video sequences properly into consideration, and hence could produce incorrect results in the presence of noise. For example, they may compute the magnitude of the noise component of the television signal by calculating the average value of the luminance component. This approach assumes that the video signal has a larger noise component when a darker scene is filmed (i.e. that the noise levels are higher for dark scenes compared to bright scenes). So, in this approach, the threshold value is set to be a low value if the luminance average value is high, but is set to a high value if the luminance average value is low.
It is therefore an object of the present invention to deal with such matters and improve the practical performance of MPEG2 coding.