A ‘multimedia’ document or object essentially comprises a plurality of modalities. For example, a multimedia object may consist of an image accompanied by textual information, which may be designated as ‘tags’. A multimedia object may also consist of a web page comprising one or more images and textual content. A multimedia object may also consist, e.g. of a scanned document divided into a plurality of channels, e.g. one channel including textual information from an optical character recognition process, commonly referred to by the initials OCR, one channel including illustrations and photographs identified in the document. A multimedia object may also consist, e.g. of a video sequence separated into a plurality of channels, e.g. a visual channel including the images of the video sequence, a sound channel including the soundtrack of the sequence, a textual channel including e.g. subtitles, or textual information originating from a process of transcription of speech into text, a channel including metadata relating to the video sequence, e.g. relating to the date, author, title, format of the sequence, etc.
It is understood that the present invention applies to any type of multimedia object, and is not necessarily limited to the aforementioned types of multimedia objects.
In practice, it may be desirable to be able to establish a description of multimedia objects, e.g. for classification or multimedia object search applications in one or more databases, by means of queries in the form of multimedia documents in the form sought, or limited to one of the modalities of the multimedia object sought; e.g. in the case where the multimedia object sought is an image associated with textual tags, a query may include only visual information, or only textual information. The search then consists in finding the multimedia documents in the database best matching the query, e.g. for then presenting them in order of relevance.
The description of a multimedia document is tricky, due to the heterogeneous nature of the modalities defining same. For example, as part of the classification of images associated with textual content, the visual modality may be transformed into feature vectors forming a low level visual description; the textual mode itself may be mapped in a dictionary reflecting a language or a particular subdomain thereof. For the purposes of classifying a visual document or a textual document, use may be made of known supervised classification techniques described below with reference to FIG. 1, more particularly ‘bags of words’ classification techniques. According to one supervised classification technique, features are extracted from a plurality of objects, for the purpose of feeding a learning system, together with labels, for producing a model, this processing being carried out offline. In a ‘test’ phase, a ‘test’ object also undergoes features extraction in a similar way, the extracted features being compared with the model produced offline for enabling a prediction, the aforementioned steps being performed online.
In order to remedy the problem related to the heterogeneity of modalities, it is possible, according to a first technique known as late fusion, to proceed to the description and classification of multimedia objects separately for the different modalities according to which the latter is defined, then belatedly merge the results obtained for the different modalities. The late fusion technique is described in detail below with reference to FIG. 2.
According to an alternative method, known as early fusion, the modalities are merged at the feature extraction level. The early fusion technique is described in detail below with reference to FIG. 3.