Akin to textual documents, multimedia documents, and especially audio-visual oriented video contents, have both syntacetic and semantic structures. In the latter, the meaning of semantic content, often referred to as scenes, episodes, story-lines, and, on a finer-level, events, is high-level knowledge conveyed by a video programme (comparable with that of sections and paragraphs in a textual document) which is in contrast with the low-level content description units of shots and frames (equivalent to sentences, words and alphabets in a textual document). With the advent of the digital era and the ubiquity of faster Internet connections, digital video content for both professional and domestic consumer environments is available at an ever increasing pace. However, these huge, mostly unstructured digital archives make it difficult, if not impossible, to access and search for any desired information without time-consuming and laborious effort. The assistance from any automatic image and multimedia handling tools for analysing, indexing and retrieving these documents would therefore be most welcome. This is especially true if the tools can interpret the semantic meaning of each document in addition to analysis at the syntacetic level. Such tools would greatly help the content management industry sector, from content production and processing, to asset reuse, synthesis, and personalised delivery.
For further background, various concepts regarding the hierarchical organisation of a video structure are described below, including a summary of the definitions used herein, and in the art, regarding, for example, computable ‘logical story units’ and video editing techniques.
A number of references are listed at the end of the description and are referred to in the description by means of numerals appearing in square brackets.
The hierarchical model of a movie structure can usually be organised on a three-level basis, comprising (from low to high level) the shot level, event level, and episode (or scene) level.
A shot is a segment of audio-visual data filmed in a single camera take. Most multimedia content analysis tasks start with the decomposition of the entire video into elementary shots, which is necessary for the extraction of audio-visual content descriptors.
An event is the smallest semantic unit of a movie. It can be a dialogue, an action scene or, in general, a set of contiguous shots which share location and time. It may happen that more events just alternate between themselves to carry on more events taking place in parallel.
An episode (or scene) is normally defined to be a sequence of shots that share a common semantic thread and can contain one or more events.
Commonly, episode boundary detection is performed using only automatically detected low-level features without any prior knowledge. It is often the case, therefore, that the detected scene boundaries do not correspond precisely to those of an actual scene. To address this issue, researchers have introduced the so-called computable scene [6], or logical story unit (LSU) [1], which reveal the best approximation to real movie episodes. Compared to actual scenes that are defined by their semantic contents, LSUs are defined in terms of specific spatio-temporal features which are characteristic of the scene under analysis.
Assuming that an event is related to a specific location (called ‘scenery’) occurring within a defined time interval in which certain movie characters are present, we can state that in general a scene can be characterised by global temporal consistency in its visual content. A definition of Logical Story Unit (LSU), taken from [1], is therefore:
“a series of temporally contiguous shots characterised by overlapping links that connect shots with similar visual content elements.”
Turning now to movie editing techniques, we discuss below techniques that are useful for discussion of example embodiments of the present invention. Reference provides a more thorough analysis of certain common conventions and techniques used in audio-visual media creation. In this reference, the focus is on different types of shots and scenes, and various uses of them in different contexts of a movie.
A shot can either be part of an event or serve for its ‘description’ [1]. This means that a shot can show a particular aspect of an event which is taking place (such as a human face during dialogue) or can show the scenery where the succeeding event takes place. In the following, these two kinds of shots are respectively referred to as ‘event’ shots and ‘descriptive’ shots.
Usually, the presence of a ‘descriptive’ shot, at the beginning of an episode, works as an introduction to the scenery for the following ‘event’ shots. For example, in the popular comedy film “Notting Hill” we see many times a shot showing a bookshop from the outside, while the succeeding shots elaborate on what is happening inside the bookshop. It is clear that the episode comprises all the shots (the one outside the shop and those inside) but automated analysis may result in the first shot not being included as part of the bookshop LSU. In this case, the LSU boundaries do not correspond exactly to the actual scene boundaries, but provide the best possible approximation.
With respect to scenes, these are normally classified into two main categories [6] namely:
N-type: these scenes (normal scenes) are characterised by a long-term consistency of chromatic composition, lighting condition and sound; and
M-type: these scenes (montage scenes) are characterised by widely different visual contents (e.g., different location, timing, lighting condition, characters, etc.) and often with long-term consistency in the audio content.
Many post-production video programme genres such as movies, documentaries, sitcoms etc. have underlying story-lines and semantic structures in addition to their syntacetic compositions. Automatic detection of these logical video segments can lead to interactive and personalised multimedia delivery and consumption by end users in an age of broadband connections or any other fast network media access. As a consequence of these potential benefits, research has been conducted into such automatic detection techniques, as outlined below.