Today's tasks relating to production of media, and especially production of multimedia streams, benefit from text labeling of media and especially media clips. This text labeling facilitates the organization and retrieval of media and media clips for playback and/or editing procedures relating to production of media. This facilitation is especially prevalent in production of composite media streams, such as a news broadcast composed of multiple media clips, still frame images, and other media recordings.
In the past, such tags have been inserted by a technician examining captured media in a booth at a considerable time after capture of the media with a portable media capture device, such as a video camera. This intermediate step between capture of media and production of a composite multimedia stream is both expensive and time consuming. Therefore, it would be advantageous to eliminate this step using speech recognition to insert tags by voice of a user of a media capture device immediately before, during, and/or immediately after a media capture activity.
The solution of using speech recognition to insert tags by voice of a user of a media capture device immediately before, during, and/or immediately after a media capture activity has been addressed in part with respect to still cameras that employ speech recognition to tag still images. However, the limited speech recognition capabilities typically available to portable media devices prove problematic, such that high-quality, meaningful tags may not be reliably generated. Also, a solution for tagging relevant portions of multi-media streams has not been adequately addressed. As a result, the need remains for a solution to the problem of high-quality, meaningful tagging of captured media on-board a media capture device with limited speech recognition capability that is suitable for use with multi-media streams. The present invention provides such a solution.