It is becoming increasingly common to use computer technology to conduct and/or record conferences. A “conference”, as used herein, is any form of interaction that takes place between two or more participants during one or more specified periods of time. Conventional types of conferences include in-person meetings, phone conferences and video conferences. However, as new technology becomes available, new forms of conferences have been developed. Thus, the interactions that constitute a conference can occur without using technology to facilitate the interaction (e.g. sitting around a conference table), using technology to facilitate the interaction (e.g. phone, video or web conferencing), and/or immersed within technology (e.g. interactions between avatars in a virtual world). For example, co-browsing the Internet, over-the-web multi-media presentations, interactions during a “quest” between players of an online game, an interactive lecture in a distance learning management system, are merely some examples of the myriad new forms of conferencing that have been developed in the relatively recent past.
New technology has not only resulted in new forms of conferencing, but has also eliminated the need for all participants to be physically present in the same location. For example, it is now common for business meetings to include one or more remote participants, or for all participants to be spread across the continent or around the world.
Conferences are often recorded for parties that are unable to attend the conferences. Even when all parties are able to attend, it is common for conferences to be recorded for later reference by the original participants or third parties. Once a set of conferences has been recorded, the usefulness of the conference recordings is largely dictated by how easily users are able to (a) locate the recordings of conferences in which they are interested, and then (b) locate content, within the conference recordings, in which the users are interested.
One way to locate the conference recordings in which one is interested is to search through the recordings based on metadata associated with the recordings. For example, audio conference recordings may take the form of digital audio files. Each digital audio file will typically have metadata such as a filename, file size, and file creation date. Using such metadata information, users may be able to identify the conference recording in which they are interested. For example, if a user knows that a particular presentation was given at a board meeting in March 2011, the user can search for the file that has “board” in the filename and has a file date that falls in the month of March 2011.
Unfortunately, file metadata is not always helpful in locating desired information in conference recordings. For example, a user may be interested in discussions about a particular topic, but not know when, or in which type of meetings, the topic was discussed. In such a situation, if file metadata were all the information available for locating the correct conference recordings, the user would be out of luck.
Due to the limitations of finding desired information from conference recordings based on file metadata alone, technologies have been developed for associating additional metadata with the conference recordings. Such additional metadata may be added manually after a conference has been recorded. For example, a designated user may listen to the conference recordings and “tag” the conference recordings with keywords that describe the topics being discussed. However, after-the-fact manual generation of keyword metadata is so time consuming as to be virtually infeasible for large sets of conference recordings.
To avoid the overhead of after-the-fact manual tagging, technologies have been developed for analyzing the media contained in conference recordings, such as audio, video, streaming media, to automatically supplement the file metadata of the conference recordings with automatically-generated metadata. Examples of technologies that may be used to automatically generate after-the-fact metadata for a conference include speech recognition, optical character recognition, natural language processing, information retrieval of captions, etc. The automatically-generated metadata that is produced by analyzing a particular conference recording is associated with the conference recording so that the automatically-generated metadata may be used as the basis for user searches.
For example, assume that a particular document was displayed during a video conference. An analysis tool may analyze the recording of the video conference, detect that a document is being displayed in the video, perform optical character recognition on a frame of the video that includes the document to obtain text from the document, determine which words within the text are keywords, and associate those keywords with the conference recording. After those keywords have been associated with the conference recording, a user could locate the conference recording by performing a search that involves one or more of those keywords.
Unfortunately, systems that automatically generate after-the-fact metadata for conference recordings are computation intensive and require tedious calibrations to recognize the contents of the multimedia streams. Typically, such systems are not robust and scalable enough for large scale deployments. Consequently, search results of conference recordings based on existing technologies are not effective. Search results often return the conference recordings whose titles, descriptions, and tags match the search strings. Even with intelligent systems as described above, the search results are often the entire video stream of a conference recording, which is coarse-grained. Specifically, even when users are lucky enough to locate, based on a comparison between their search terms and automatically-generated metadata, a conference recording that has information in which they are interested, the users are still typically required to scan through the video of the conference recording to identify the relevant sections. When the conference is long, scanning the video to locate the relevant sections may be extremely tedious and time consuming.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.