The present invention relates to the statistical analysis of digital video signals, and in particular to the statistical analysis of digital video signals for automated content interpretation in terms of semantic labels. The labels can be subsequently used as a basis for tasks such as content-based retrieval and video abstract generation.
Digital video is generally assumed to be a signal representing the time evolution of a visual scene. This signal is typically encoded along with associated audio information (eg., in the MPEG-2 audiovisual coding format). In some cases information about the scene, or the capture of the scene, might also be encoded with the video and audio signals. The digital video is typically represented by a sequence of still digital images, or frames, where each digital image usually consists of a set of pixel intensities for a multiplicity of colour channels (eg., R, G, B). This representation is due, in a large part, to the grid-based manner in which visual scenes are sensed.
The visual, and any associated audio signals, are often mutually correlated in the sense that information about the content of the visual signal can be found in the audio signal and vice-versa. This correlation is explicitly recognised in more recent digital audiovisual coding formats, such as MPEG-4, where the units of coding are audiovisual objects having spatial and temporal localisation in a scene. Although this representation of audiovisual information is more attuned to the usage of the digital material, the visual component of natural scenes is still typically captured using grid-based sensing techniques (ie., digital images are sensed at a frame rate defined by the capture device). Thus the process of digital video interpretation remains typically based on that of digital image interpretation and is usually considered in isolation from the associated audio information.
Digital image signal interpretation is the process of understanding the content of an image through the identification of significant objects or regions in the image and analysing their spatial arrangement. Traditionally the task of image interpretation required human analysis. This is expensive and time consuming, consequently considerable research has been directed towards constructing automated image interpretation systems.
Most existing image interpretation systems involve low-level and high-level processing. Typically, low-level processing involves the transformation of an image from an array of pixel intensities to a set of spatially related image primitives, such as edges and regions. Various features can then be extracted from the primitives (eg., average pixel intensities). In high-level processing image domain knowledge and feature measurements are used to assign object or region labels, or interpretations, to the primitives and hence construct a description as to xe2x80x9cwhat is present in the imagexe2x80x9d.
Early attempts at image interpretation were based on classifying isolated primitives into a finite number of object classes according to their feature measurements. The success of this approach was limited by the erroneous or incomplete results that often result from low-level processing and feature measurement errors that result from the presence of noise in the image. Most recent techniques incorporate spatial constraints in the high-level processing. This means that ambiguous regions or objects can often be recognised as the result of successful recognition of neighbouring regions or objects.
More recently, the spatial dependence of region labels for an image has been modelled using statistical methods, such as Markov Random Fields (MRFs). The main advantage of the MRF model is that it provides a general and natural model for the interaction between spatially related random variables, and there are relatively flexible optimisation algorithms that can be used to find the (globally) optimal realisation of the field. Typically the MRF is defined on a graph of segmented regions, commonly called a Region Adjacency Graph (RAG). The segmented regions can be generated by one of many available region-based image segmentation methods. The MRF model provides a powerful mechanism for incorporating knowledge about the spatial dependence of semantic labels with the dependence of the labels on measurements (low-level features) from the image.
Digital audio signal interpretation is the process of understanding the content of an audio signal through the identification of words/phrases, or key sounds, and analysing their temporal arrangement. In general, investigations into digital audio analysis have concentrated on speech recognition because of the large number of potential applications for resultant technology. eg., natural language interfaces for computers and other electronic devices.
Hidden Markov Models are widely used for continuous speech recognition because of their inherent ability to incorporate the sequential and statistical character of a digital speech signal. They provide a probabilistic framework for the modelling of a time-varying process in which units of speech (phonemes, or in some cases words) are represented as a time sequence through a set of states. Estimation of the transition probabilities between the states requires the analysis of a set of example audio signals for the unit of speech (ie., a training set). If the recognition process is required to be speaker independent then the training set must contain example audio signals from a range of speakers.
According to one aspect of the present invention there is provided a method of interpreting a digital video signal, wherein said digital video signal has contextual data, said method comprising the steps of:
segmenting said digital video signal into one or more video segments, each segment having a corresponding portion of said contextual data; and
analysing each video segment to provide a graph at one or more temporal instances in the respective video segment dependent upon said corresponding portion of said contextual data.
According to another aspect of the present invention there is provided an apparatus for interpreting a digital video signal, wherein said digital video signal has contextual data, said apparatus comprising:
means for segmenting said digital video signal into one or more video segments, each segment having a corresponding portion of said contextual data; and
means for analysing each video segment to provide an analysis token for one or more regions contained in the respective video segment dependent upon said corresponding portion of said contextual data.
According to still another aspect of the present invention there is provided a computer program product comprising a computer readable medium having recorded thereon a computer program for interpreting a digital video signal, wherein said digital video signal has contextual data, said computer program product comprising:
means for segmenting said digital video signal into one or more video segments, each segment having a corresponding portion of said contextual data; and
means for analysing each video segment to provide an analysis token for one or more regions contained in the respective video segment dependent upon said corresponding portion of said contextual data.