The present invention relates generally to multimedia meeting recordings and more specifically to access of multimedia meeting recordings.
Progress in the business world typically finds its beginnings with a series of meetings. Meetings are usually drawn out unstructured affairs, consisting of mostly irrelevant information. However, there usually are one or two defining moments that occur during a meeting which can propel the enterprise in the forward direction toward success, or if missed, can result in yet another failed business venture.
Many businesspersons visit remote locations and participate in meetings with different people on a regular basis. A common task that must be performed at some time after the meeting is the creation of a summary of what happened during the meeting. The summary may include reports of who said what, the ideas that were conceived, the events that occurred, and the conclusions that were reached. Oftentimes, it is not just the specific conclusions but also the reasons they were reached and the points of view expressed by the meeting participants that are important.
Producing an accurate meeting summary is a time-consuming and error-prone process, especially if the only record available is one's own memory, perhaps supplemented with hastily handwritten notes. A commonly used portable memory aid is the audiocassette recorder. It can be effective, but lacks the ability to capture important events that could be helpful later, such as gestures, images of participants, body language, drawings, and so on. An easy-to-use method for incorporating video data would help solve this problem.
Meeting recordings can help. Capturing the content of meetings is useful in many respects. Recordings of a meeting can capture the meeting activity and then later be reviewed as needed to refresh one's memory. Knowing that a meeting is being recorded allows the participants to more effectively concentrate on the meeting discussion since the details can be later reviewed in an offline manner. Audio-visual recordings of meetings provide the capability for reviewing and sharing meetings, clarifying miscommunications, and thus increase efficiency.
Recognizing this need, several meeting recorder systems have been developed in recent years. A multimodal approach to creating meeting records based on speech recognition, face detection and people tracking has been reported in CMU's Meeting Room System described by Foote, J. and Kimber, D., FlyCam: Practical Panoramic Video and Automatic Camera Control, Proceedings of International Conference on Multimedia & Expo, vol. 3, pp. 1419-1422, 2000. Gross, R., Bett, M. Yu, H., Zhu, X., Pan, Y., Yang, J., Waibel, A., Towards a Multimodal Meeting Record, Proceedings of International Conference on Multimedia and Expo, pp. 1593-1596, New York, 2000 also describe a meeting recorder system.
However, people generally prefer not want to watch a recorded meeting from beginning to end. Like the meeting from which the recording was produced, recorded meetings are not amenable to a hit-or-miss search strategy. After fast-forwarding a few times in a meeting video while looking for something, most people will give up unless what they are seeking is important enough to suffer the tedium.
More often than not, people are only interested in an overview of the meeting or just the interesting parts. Enabling efficient access to captured meeting recordings is essential in order to benefit from this content. Searching and browsing audiovisual information can be a time consuming task. Two common approaches for overcoming this problem include key-frame based representations and summarization using video skims. Key-frame based representations have proven to be very useful for video browsing as they give a quick overview to the multimedia content. A key-frame based technique is described by S. Uchihashi, J. Foote, A. Girgensohn, and J. Boreczky, Video Manga: Generating Semantically Meaningful Video Summaries, ACM Multimedia, (Orlando, Fla.) ACM Press, pp. 383-392, 1999.
On the other hand, video skims are content-rich summaries that contain both audio and video. Efficiently constructed video skims can be used like movie trailers to communicate the essential content of a video sequence. For example, A. Waibel, M. Bett, et al., Advances in Automatic Meeting Record Creation and Access, Proceedings of ICASSP, 2001 propose summarization of meeting content using video skims with a user-determined length. The skims are generated based on relevance ranking and topic segmentation using speech transcripts.
A summarization technique for educational videos based on shot boundary detection, followed by word frequency analysis of speech transcripts, is suggested by C. Taskiran, A. Amir, D. Ponceleon, and E. J. Delp, Automated Video Summarization Using Speech Transcripts, SPIE Conf. on St. and Ret. for Media Databases, pp. 371-382, 2002.
A method for summarizing audio-video presentations using slide transitions and/or pitch activity is presented by He, L., Sanocki, E., Gupta, A., and Grudin, J., Auto-summarization of audio-video presentations, In Proc. ACM Multimedia, 1999. The authors suggest a method of producing presentation summaries using video channel, audio channel, speaker's time spent on a slide, and end user's actions. The authors discuss the use of pitch information from the audio channel, but their studies showed the technique as being not very useful for summary purposes. Instead, they indicate that the timing of slide transitions can be used to produce the most useful summaries.
Motion content in video can be used for efficiently searching and browsing particular events in a video sequence. This is described, for example, by Pingali, G. S., Opalach, A., Carlbom, I., Multimedia Retrieval Through Spatio-temporal Activity Maps, ACM Multimedia, pp. 129-136, 2001 and by Divakaran, A., Vetro, A., Asai, K., Nishikawa, H., Video Browsing System Based on Compressed Domain Feature Extraction, IEEE Transactions on Consumer Electronics, vol. 46, pp. 637-644, 2000. As a part of the Informedia™ project, Christel, M., Smith, M., Taylor, C. R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions, Proc. of the ACM CHI, pp. 171-178, 1998 compared video skimming techniques using (1) audio analysis based on audio amplitude and term frequency-inverse document frequency (TF-IDF) analysis, (2) audio analysis combined with image analysis based on face/text detection and camera motion, and (3) uniform sampling of video sequences. They reported that audio analysis combined with visual analysis yield significantly better results than skims obtained purely by audio analysis and uniform sampling.
In Sun, X., Foote, J., Kimber, D., and Manjunath, Panoramic Video Capturing and Compressed Domain Virtual Camera Control, ACM Multimedia, pp. 229-238, 2001, a user-oriented view is provided based on speaker motion. A perhaps more intuitive solution is to compute the speaker direction as suggested by Rui, Y., Gupta, A., and Cadiz, J., Viewing Meetings Captured by an Omni-directional Camera, ACM CHI 2001, pp. 450-457, Seattle, March 31-Apr. 4, 2001. Techniques such as summarization and dialog analysis aimed at providing a higher level of understanding of the meetings to facilitate searching and retrieval have been explored by Hauptmann, A. G., and Smith, M., Text Speech and Vision for Video Segmentation: The Informedia Project, Proceedings of the AAAI Fall Symposium on Computational Models for Integrating Language and Vision, 1995.
Analysis of the audio signal is useful in finding segments of recordings containing speaker transitions, emotional arguments, and topic changes, etc. For example, in S. Dagtas, M. Abdel-Mottaleb, Extraction of TV Highlights using Multimedia Features, Proc. of MMSP, pp. 91-96, 2001, the important segments of a sports video were determined based on audio magnitude.
In a paper by Bagga, J. Hu, J. Zhong and G. Ramesh, Multi-source Combined-Media Video Tracking for Summarization, The 18th International Conference in Pattern Recognition (ICPR'02) Quebec City, Canada, August 2002, the authors discuss the use of text analysis (from closed captioning data) and video analysis to summarize video sequences. Text analysis is performed to find topics by using a similarity measure based on Salton's Vector Space Model. Visual analysis is based on dominant color analysis. Feature vectors found by text and visual analysis are normalized, combined into one feature vector, and hierarchical clustering is used for final clustering. Clustering is performed across multiple videos to find the common news stories. These stories are then used in the summary. Their technique identifies similarities across different video tracks.
Speech content and natural language analysis techniques are commonly used for meeting summarization. However, language analysis-based abstraction techniques may not be sufficient to capture significant visual and audio events in a meeting, such as a person entering the room to join the meeting or an emotional charged discussion. Therefore, it can be appreciated that continued improvement in the area of processing meeting recordings is needed to further facilitate effective access and retrieval of meeting recording information.