The evolution of video recoding devices over past decades has brought various opportunities to record and store video. Most of the video had been recorded into video cassettes in the past. Then the majority of recording media shifted to optical discs such as CD and DVD. Currently hard disk drives (HDD) are the favorite for multimedia materials storage due to its downward price trend. The price decline of HDD has promoted the evolution of video recording devices capable of recording multiple broadcasted TV materials simultaneously.
Thanks to the evolution of these video recording devices, today's consumers may record and store much more video material than before. This causes the problem of watching time scarcity. The time for today's consumer to playback those recorded materials is limited. Thus there is a strong demand to watch video in a shorter time. There are two approaches to this problem. Many studies have been made in this area. One approach accelerates the playback speed. This is straightforward. The other one approach detects and extracts only the scenes with important events of the video program. Skipping non-important scenes at playback time saves time. The present invention is related to this second approach.
In this technique every scene of video needs to be evaluated correctly. Classification is essential for such an evaluation. Most conventional techniques use various audio and video characteristics of each scene. Video signal processing is believed much more complex than audio signal processing due to the number of samples processed per unit time. The short-time energy of audio signal is the simplest feature among the various characteristics of an audio signal.
The prior art provides audio energy based scene classification. This prior art technique divides the frequency spectrum into several sub-bands and determines the short-time energy of each sub-band. Significant events in sports videos such as scoring opportunities and fine plays tend to be strongly correlated with the instantaneous audio signal energy. Cheers and applause of the audience and excited speech of announcers tend to occur during such events. Extracting the scenes with high audio energy tends to result in the abridgment of the whole video material.
This audio energy based technique tends to over-detect non-significant scenes from pre/post-game short TV program or noisy commercial messages. These portions of the video occasionally contain loud music or artificial sound effects which are rarely found during play in ordinary sports TV program. The prior art provides various audio-based and video-based algorithms to avoid detection of such unwanted scenes. Unfortunately, those algorithms require much more computational resources than the audio energy based technique. These techniques are undesirable for portable devices having limited computational resources and battery energy.
It is therefore desirable to detect scenes with significant events with reasonable accuracy using an audio energy based technique while avoiding unwanted scenes.