This invention generally relates to indexing files, and more specifically, pertains to the use of speech recognition of MPEG/Audio encoded files to permit indexing the files in regard to content.
The Motion Picture Experts Group (MPEG) set of standards have become an enabling technology for many of today""s multimedia applications. MPEG encoded video facilitates the transmission of movies and other multimedia works over networks and the efficient storage of video data in a digital library. As video libraries for storing multimedia works become common and grow in size, the need to automatically and consistently index the content of videos becomes increasingly more important. A video that has been richly indexed can open a host of new applications. For example, knowing the location in a video where a favorite actor speaks or where a keyword is spoken can enable a variety of large scale implementations of hyperlinked video and facilitate new training and educational programs.
An audio track accompanying a video contains an enormous amount of information, some of which can be inferred by the human listener. For example, by listening to the audio portion independent of the video action, in addition to understanding what is being said, humans can infer who is doing the talking. Other types of information that can be inferred are gender of the speaker, language, emotion, and the style of the video material, e.g., dramatic or comedic. Therefore, a great opportunity lies in the automatic extraction of these audio elements so that they can be used as indices in framing queries and in effecting the retrieval of associated video segments.
There have been a great number of publications on work related to video indexing. The majority are focused on scene cuts and camera operations such as pan and zoom. Knowing only where a scene cut is or when a camera movement occurs in a video is not enough to develop a usable index of the video based on content. The practical application of indexing, particularly for non-professional applications, requires more content than has been provided by the prior art. Clearly, it is more likely that a home video user will employ an index to access a particular type of program in a video library rather than to locate a particular kind of camera shot.
Complete, general, and unrestricted automatic video indexing based on the analysis of the accompanying audio may not be possible for some time. Nevertheless, with restrictions, successful application of current speech recognition technology can be applied to automatic video indexing for domain specific applications. For example, in a sports video of a football game, a small set of keywords can be used and recorded onto the audio track to describe plays and formations such as xe2x80x9cfake puntxe2x80x9d and xe2x80x9cI-32 wide right.xe2x80x9d Being able to recognize these connected words in the accompanying audio can lead to a very useful index that allows a football coach to select desired plays and go directly to the associated video segment during a training session or analysis of the game recorded on the video.
From the foregoing discussion, it will be apparent that a technique capable of recognizing continuously uttered words in compressed MPEG/Audio for use as indices would be a valuable tool. Working in the compressed domain has a number of benefits. It saves computation time, since decompression is not necessary, and it saves disk space, since no intermediate file is created (e.g., as would be required if the audio track on the MPEG files had to be decompressed first).
There is not much published work on the direct analysis of the audio accompanying video material for speech recognition in video indexing. A system has been described to determine voiced, unvoiced, and silent classification directly on MPEG/Audio, and various time domain speech analysis techniques such as the short-time energy function for endpoint detection have been described in the prior art. These initial efforts contributed to simple video indexing based on audio, but the extracted information has been limited to the characteristics of the audio signal and not its content. However, just knowing where voicing and silence occurs is not enough. Additional information is needed in order to be useful for content-based video indices.
If the video contains a closed-captioned signal, an index can be readily made from the words extracted from the signal. A prior art system that makes use of this technique uses closed-captioned signal to generate pictorial transcripts of the video. Unfortunately, there is a problem with the limited amount of video material that comes with closed-captioning. In another prior art system, the Sphinx-II speech recognition system has been applied to a closed-captioned signal. However, when closed captioning was not available in this system, speech recognition was used exclusively. The reported word error rate ranged from xcx9c8-12% for speech benchmark evaluation to xcx9c85% for commercials.
Recently, an integrated approach of using image and speech analysis for video indexing has been described in the art. The video and audio are separated, with video encoded into AVI and audio into WAV files. A speech recognition system based on template matching was used in this system. However, using template matching for speech recognition is crude, as it has problems dealing with the variability that exists in speech.
Unlike the prior art described above, it would be preferable to derive speech features for recognition from the compressed MPEG/Audio domain, and to extract higher level indices that go beyond the characteristics of the audio signal by applying speech recognition. Accuracy of the process should be enhanced by using a robust technique and by restricting the scope for recognition to specific applications (i.e., not for general speech recognition).
It will be understood that MPEG-1 was used in developing the present invention, and as used throughout this description, the term xe2x80x9cMPEG/Audioxe2x80x9d will mean MPEG-1 and MPEG-2/Audio.) The underlying speech recognizer employed in a preferred form of the invention is based on the Hidden Markov model. Decompression is not required, as training and recognition is performed based on the extracted subbands of the MPEG/Audio files.
In accord with the present invention, a method is defined for recognizing speech in an MPEG file. The method includes the step of determining speech features of the MPEG file without first decompressing the MPEG file. Using the speech features, a speech recognition model is trained to recognize words in a different MPEG file. The speech recognition model is then employed to recognize words spoken in an audio portion of the different MPEG file, also without first decompressing that MPEG file.