The present invention is related to the use of speech recognition in data capture, processing, editing, display, retrieval and playback. The invention is particularly useful for capture, authoring, and playback of synchronized audio and video data.
While speech recognition technology has been developed over several decades, there are few applications in which speech recognition is commonly used, except for voice assisted operation of computers or other equipment, and for transcription of speech into text, for example, in word processors.
Use of speech recognition with synchronized audio and video has been primarily for developing searchable indexes of video databases. Such systems are shown in, for example: xe2x80x9cAutomatic Content Based Retrieval Of Broadcast News,xe2x80x9d by M. G. Brown et al. in Proceedings of the ACM International Multimedia Conference and Exhibition 1995, pages 35-43; xe2x80x9cVision: A Digital Video Library,xe2x80x9d by Wei Li et al., Proceedings of the ACM International Conference on Digital Libraries 1996, pages 19-27; xe2x80x9cSpeech For Multimedia Information Retrieval,xe2x80x9d by A. G. Hauptmann et al. in Proceedings of the 8th ACM Symposium on User Interface and Software Technology, pages 79-80, 1995; xe2x80x9cKeyword Spotting for Video Soundtrack Indexing,xe2x80x9d by Philippe Gelin, in Proceedings of ICASSP ""96, page 299-302, May 1996; U.S. Pat. No. 5,649,060 (Ellozy et al.); U.S. Pat. No. 5,199,077 (Wilcox et al.); xe2x80x9cCorrelating Audio and Moving Image Tracks,xe2x80x9d IBM Technical Disclosure Bulletin No. 10A, Mar. 1991, pages 295-296; U.S. Pat. No. 5,564,227 (Mauldin et al.); xe2x80x9cSpeech Recognition In The Informedia Digital Video Library: Uses And Limitations,xe2x80x9d by A. G. Hauptmann in Proceedings of the 7th IEEE Int""l. Conference on Tools with Artificial Intelligence, pages 288-294, 1995; xe2x80x9cA Procedure For Automatic Alignment Of Phonetic Transcriptions With Continuous Speech,xe2x80x9d by H. C. Leung et al., Proceedings of ICASSP ""84, pages 2.7.1-2.7.3, 1984; European Patent Application 0507743 (Stenograph Corporation); xe2x80x9cIntegrated Image And Speech Analysis For Content Based Video Indexing,xe2x80x9d by Y-L. Chang et al., Proceedings of Multimedia 96, pages 306-313, 1996; and xe2x80x9cFour Paradigms for Indexing Video Conferences,xe2x80x9d by R. Kazman et al., in IEEE Multimedia, Vol. 3, No. 1, Spring 1996, pages 63-73, all of which are hereby incorporated by reference.
Current technology for editing multimedia programs, such as synchronized audio and video sequences, includes systems such as the media composer and film composer systems from Avid Technology, Inc. of Tewksbury, Massachusetts. Some of these systems use time lines to represent a video program. However, management of the available media data may involve a time intensive manual logging process. This process may be difficult where notations from a script, and the script are used, for example, on a system such as shown in U.S. Pat. No. 4,476,994 (Ettlinger). There are many other uses for speech recognition than mere indexing that may assist in the capture, authoring and playback of synchronized audio and video sequences using such tools for production of motion pictures, television programs and broadcast news.
Audio associated with a video program, such as an audio track or live or recorded commentary, may be analyzed to recognize or detect one or more predetermined sound patterns, such as words or sound effects. The recognized or detected sound patterns may be used to enhance video processing, by controlling video capture and/or delivery during editing, or to facilitate selection of clips or splice points during editing.
For example, sound pattern recognition may be used in combination with a script to automatically match video segments with portions of the script that they represent. The script may be presented on a computer user interface to allow an editor to select a portion of the script. Matching video segments, having the same sound patterns for either speech or sound effects can be presented as options for selection by the editor. These options also may be considered to be equivalent media, although they may not come from the same original source or have the same duration.
Sound pattern recognition also may be used to identify possible splice points in the editing process. For example, an editor may look for a particular spoken word or sound, rather than the mere presence or absence of sound, in a sound track in order to identify an end or beginning of a desired video segment.
The presence of a desired sound or word in an audio track also may be used in the capturing process to identify the beginning or end of a video segment to be captured or may be used to signify an event which triggers recording. The word or sound may be identified in the audio track using sound pattern recognition. The desired word or sound also may be identified in a live audio input from an individual providing commentary either for a video segment being viewed, perhaps during capture, or for a live event being recorded. The word or sound may be selected, for example, from the script, or based on one or more input keywords from an individual user. For example, a news editor may capture satellite feeds automatically when a particular segment includes one or more desired keywords. When natural breaks in the script are used, video may be divided automatically into segments or clips as it is captured.
Speech recognition also may be used to provide for logging of material by an individual. For example, a live audio input from an individual providing commentary either for a video segment being viewed or for a live event being recorded, may be recorded and analyzed for desired words. This commentary may be based on a small vocabulary, such as commonly used for logging of video material, and may be used to index the material in a database.