Recently, the entertainment industry has developed a new genre of movie/television entertainment called “Reality TV,” or “unscripted programming.” In this genre, untrained actors are placed in various settings with general direction and rules to guide their interactions, but without a specific script for actions or dialog. Beforehand, the production staff has a general idea of the storyline for the production, but the final storyline will depend upon the interactions that take place. Several video cameras are located within the settings and record the interactions among the actors for long periods of time. Various stimuli may be introduced into the settings by the production staff to provoke unpredictable interactions among the actors. After several settings have been videotaped for several hours over several days by several cameras, the production staff reviews hundreds to thousands of hours of videotape and constructs a final storyline for the show.
The mechanics of collecting and cataloging such a huge volume of videotape has given rise to the need to digitize the tapes and to automatically detect the beginnings and endings of scenes, and especially to detect the ending of the last scene on a videotape. The latter need is particularly important to address, because many tapes are only recorded on a portion of their full length due to the somewhat haphazard nature of the filming for this genre, and because the expense and time needed for digitization and cataloging should not be wasted on blank tape.
This task of automatically detecting the end of a scene is complicated by several factors. First, there are often several minutes with no activity in a scene. Thus, attempting to detect the end of a scene by detecting the lack of motion in the video image would not work. Second, the cameras are often placed in close proximity to the actors, who often walk in front of the camera lens during a scene, causing a considerable change in the brightness of the video image. Thus, attempting to detect the end of a scene by detecting a large change in the brightness of the video image would not work. Another complicating factor is that video can be stored and conveyed in several different analog and digital formats, some of which do not have data fields for time stamps. And even if a video medium does have data fields for time stamps, there remains the likelihood that the camera will be set to record without the time stamp function being activated. Thus, relying on time stamp codes to determine the end of a scene is not practical when hundreds to thousands of hours of video recordings need to be digitized and cataloged. Thus, the problem remains unsolved.
The accurate detection of the beginning of a scene also has important applications. One such application is in the field of television program distribution where local commercial programming has to be interleaved between the segments of a television program. The television program may be provided to the local broadcaster in either live feed form, or on videotape. In either case, the time periods for the local commercials are marked out on the video medium. A human operator views the medium as it is fed or played to local broadcast, and then switches a commercial segment onto the local broadcast at the end of each program segment, during the marked out periods. After the commercial programming segment ends, the human operator places a local-station identifier (or other filler) on the local broadcast until the next program segment starts. Here, the human operator is needed to watch for the start of the new program segment on the live feed or the videotape in order to remove the local-station identifier from the local broadcast and feed the next program segment onto the local broadcast. This is a tedious task for the human operator, and is prone to errors.
Before describing the present invention and how it solves the above problems related to scene detection, we provide a brief description of video media. Moving pictures can be electronically conveyed and stored on a variety of video media. A video medium comprises a sequence of video frames to be displayed in sequential order at a specific frame rate, with each video frame being one complete picture. The video medium further comprises a format for defining the aspect ratio of the display area, the resolution of the display area, and the frame rate of the video frames. There are analog formats such as NTSC, PAL, VHS, and BETAMAX, and digital formats such as MPEG-2, CCIR-601, AVI, and DV. Each format, and hence each video medium, defines an array of pixels for the video frames, and hence defines an array of pixels for displaying the visual images that are to be conveyed or stored by the medium. The pixel array is dimensioned to span an area having the same aspect ratio as that defined by the format, and to have a density of pixels along the dimensions of the area in accordance with the resolution defined by the format. The aspect ratio is usually square or rectangular, and the pixels are usually arranged in the form of rows and columns, although other patterns are possible (such as hexagonal close packed). Each video frame comprises a set of pixel data values for representing an image on the array of pixels, each pixel data value being associated with a pixel in the array.
In analog video formats, the number of rows of pixels is fixed at a predetermined number, while the analog signal theoretically enables an infinite number of columns of pixels. Conventional television screens, however, limit the number of columns so that the density of the columns is within 100% to 200% of the density of the rows, as a practical matter. Some digital video formats have a preset number of pixels along each dimension of the display area, whereas others have user-defined values. For example, the CCIR-601 video format has 526 rows of pixels and 720 columns of pixels, whereas the MPEG-2 format permits the dimensions to be user-defined. However, there are several standardized dimensions for MPEG-2, ranging from 352×288 (columns×rows) to 720×480. Most video formats have set frame rates, whereas some have user-defined frame rates (e.g., MPEG-2). For those formats that have user-defined parameters, the format sets aside digital bits within the video medium to convey or store the user-defined parameters so that the images stored by the video medium can be properly rendered. In addition, some video formats, such as NTSC, divide each frame into two interlaced fields (A-field and B-field) such that the two fields are shown in successive order at twice the frame rate. As indicated above, the wide variety of video formats is a complicating factor for the task of detecting scenes in video produced in the unscripted environment.