In one of television broadcasts, there is a genre so called “song program” or “music program”. In many cases, music programs consist of pictures of singing or playing instruments by performers or music pictures through video streams (hereinafter, referred to as “music scene” and it is noted that in the present specification the music scene denotes pictures as a whole including in its audio, music such as singing and playing instruments), and consist of pictures other than music, such as introductions of music pieces by the host and others and talk (conversation) shows by the host with its performers (hereinafter, referred to as “non-music scene”).
In commercial broadcasting, programs may include commercial message broadcasts for advertisements from program sponsors or broadcasters themselves (hereinafter, referred to as “CM broadcast”, and a commercial broadcast segment is referred to as “CM broadcast segment”.
When playing back a recorded music program, a viewer who wants to concentrate on music has a request to efficiently skip scenes other than music ones such as non-music scenes and CM broadcasts. On the other hand, a viewer who is not interested in music has a request to view only non-music scenes such as talk shows by skipping music scenes and CM broadcasts.
For such requests, a conventional method of detecting and recording music identifies music scenes making use of a feature that peaks in the frequency spectrum of sound information are temporally stable in frequency, to store only audio/video attributed to the music scenes (for example, refer to Patent Document 1).    Patent Document 1: Japan Patent Application Publication No. H11-266435 (FIG. 1 on page 5)
However, in a method of detecting music as disclosed in Patent Document 1, since detection of music scenes is discriminated with such a single technique, it is difficult to ensure equal detection accuracy over the whole variety of music with various tones, such as rock, popular ballad, and classic.
The present invention is made to resolve the problem as described above and to provide a method and a device for efficiently detecting music scenes from data containing a video signal and an audio signal of a television broadcast and the like.