Streaming audio-video technology has grown at such a rapid rate in recent years that there is now a constant influx of vast amounts of audio-video content into the Internet, available for access at any time from anywhere in the world. However, the abundance of audio-video data also gives rise to a significant challenge facing content providers, which is how to allow users to examine such large amount of audio-video data efficiently and receive concise representations of desired content. As a result, research on audio-video summarization has received increasing attention.
Much work on audio-video summarization to date has been carried out separately in two different research communities, each concentrating on different areas. One research community is the speech and natural language processing community. Systems in this area rely almost exclusively on the text stream associated with an audio-video segment. The text stream is usually obtained either through closed captioning or transcribed speech, although sometimes limited non-text-related audio features are also used. Various techniques have been developed to analyze the text stream and perform story boundary detection, topic detection and topic tracking in broadcast news. The tracked stories then provide the foundation for text based summarization. An exemplary article in this area is Jin et al., “Topic Tracking for Radio, TV Broadcast and Newswire,” Proc. of the DARPA Broadcast News Workshop, 199-204 (1999), the disclosure of which is hereby incorporated by reference.
The image and video processing community has also vigorously pursued audio-video summarization. Here, the emphasis has been on analyzing the image sequences in an audio-video segment and segmenting or clustering images based on various measures of visual similarity. An exemplary article in this area is Yeung et al., “Time-Constrained Clustering for Segmentation of Video into Story Units,” Proc. of the 13th Int'l Conf. on Pattern Recognition, 357-380 (1996), the disclosure of which is hereby incorporated by reference.
While such text and video processing techniques have helped in audio-video summarization, these techniques still could be improved. A need therefore exists for techniques that improve upon the text and video processing techniques currently in use.