Multi-modal sharing and organization of information between users has been, an area of research and study among the HCI community. Lot of work exists on annotation of media using implicit tags from natural interactions and the resulting systems are sometimes called observant systems. High quality tagging is critical for organizing the media and provides a great experience during its sharing and consumption. Lack of explicit tagging, for meta-data, by users on media is a well known problem for obvious reasons of tediousness and the lack of motivation. However, users commonly share the media with their friends and family on personal devices, such as PCs, mobiles, PDAs, and the like.
The conversations that happen, in the context of a shared media consumption scenario, are rich with content that is related to the media. Conversations around the media, say photos, for example, include who is in the photo, who took the photo, where and when it was taken and what happened around the photo and so on.
Existing methods for implicit tagging use information from only one of the modalities, such as speech or text, to tag the media. These methods use speech information for tagging media in both recognized speech mode or in un-recognized speech mode. The speech information may be captured from the discussions during sharing of the media in a multi-user scenario, or from the speech utterances by a user to describe the media. In recognized speech mode, a speech recognizer is used to convert the speech information to text and tag the media with the text information. In un-recognized speech mode, the speech information is used to tell a story about the media. In this case, the speech of a person trying to describe a photograph is attached to the photograph and whenever the photograph is viewed asynchronously, the speech can be listened to know more about the photograph.
However, the above existing methods also captures a lot of conversation that may not be related to the multimedia which if extracted to tag the media may lead to irrelevant tags and hence provides a dissatisfying user experience. The brute force way would be to try, recognize and interpret all the speech and then extract the tag information which is a very expensive method and may still result in a lot of irrelevant tags.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present subject matter in any way.