1. Technical Field
The present invention relates to a method of analysing a sequence of video frames, the method being used particularly but not exclusively for selection of video recordings according to user preferences and for providing highlights from a video sequence.
2. Related Art
The number of digital video databases in both professional and consumer sectors is growing rapidly. These databases are characterised by a steadily increasing capacity and content variety. Since searching manually through terabytes of unorganised data is tedious and time-consuming, transferring search and retrieval tasks to automated systems becomes extremely important in order to be able to efficiently handle stored video.
Such automated systems rely upon algorithms for video content analysis, using models that relate certain signal properties of a video recording to the actual video content.
Due to the large number of possibilities of analysing a video recording, its content can be perceived in many different ways. Three different levels of video content perception are defined corresponding to three different techniques for analysing a video recording. These levels are known as the feature level, the cognitive level and the affective level.
Video analysis algorithms generally start at the feature level. Examples of features are how much red is in the image, or whether objects are moving within a sequence of images. Specifying a search task at this level is usually the simplest option (e.g. “Find me a video clip featuring a stationary camera and a red blob moving from left to right!”).
At the cognitive level a user is searching for “facts”. These facts can be, for example, a panorama of San Francisco, an outdoor or an indoor scene, a broadcast news report on a defined topic, a movie dialogue between particular actors or the parts of a basketball game showing fast breaks, steals and scores.
Specifying a search task at the cognitive level implies that a video analysis algorithm is capable of establishing complex relations among features and recognizing, for instance, real objects, persons, scenery and story contexts. Video analysis and retrieval at the cognitive level can be provided using advanced techniques in computer vision, artificial intelligence and speech recognition.
Most of the current worldwide research efforts in the field of video retrieval have so far been invested in improving analysis at the cognitive level.
Owing to the rapidly growing technological awareness of users, the availability of automated systems that can optimally prepare video data for easy access is important for commercial success of consumer-oriented multimedia databases. A user is likely to require more and more from his electronic infrastructure at home, for example personalised video delivery. Since video storage is likely to become a buffer for hundreds of channels reaching a home, an automated system could take into account the preferences of the user and filter the data accordingly. Consequently, developing reliable algorithms for matching user preferences to a particular video recording is desirable in order to enable such personalised video delivery.
In this description we define the affective content of a video recording as the type and amount of feeling or emotion contained in a video recording which is conveyed to a user. Video analysis at the affective level could provide for example, shots with happy people, a romantic film or the most exciting part of a video recording.
While cognitive level searching is one of the main requirements of professional applications (journalism, education, politics etc), other users at home are likely to be interested in searching for affective content rather than for “all the clips where a red aeroplane appears”. For example finding photographs having a particular “mood” was the most frequent request of advertising customers in a study of image retrieval made with Kodak Picture Exchange. An user may want to search for the “funniest” or “most sentimental” fragments of a video recording, as well as for the “most exciting” segments of a video recording depicting a sport event. Also in the case of a complex and large TV broadcast such as the Olympic Games, the user is not able to watch everything so it is desirable to be able to extract highlights
Extraction of the “most interesting” video clips and concatenation of them together in a “trailer”—is a particularly challenging task in the field of video content analysis. Movie-producers hope to achieve enormous financial profits by advertising their products—movies—using movie excerpts that last only for several tens of seconds but are capable of commanding the attention of a large number of potential cinemagoers. Similarly other categories of broadcasts, especially the sport events, advertise themselves among the TV viewers using the “most touching scenes in the sport arena” with the objective of selling their commercial blocks as profitably as possible. When creating the trailer, affective analysis of a video recording to be abstracted can provide the most important clues about which parts of a video recording are most suitable for being an element of it. Such a trailer can also be created remotely—directly at user's home.
However, known algorithms do not address video analysis at the third, affective level. Assuming that a “cognitive” analysis algorithm has been used to find all video clips in a database that show San Francisco, additional manual effort is required to filter the extracted set of clips and isolate those that radiate a specific feeling (e.g. “romantic”) or those that the user simply “likes most”.