1. Field of the Invention
The present invention relates to signal processing methods and apparatuses, signal processing programs, and recording media, and in particular, to a signal processing method and apparatus for evaluating similarity between different sections of at least one audio signal which include identical or similar audio signal components, a signal processing program used therewith, and a recording medium containing the signal processing program.
2. Description of the Related Art
In general, an audio signal consists of a plurality of signal components which are simultaneously or consecutively superimposed on one another. However, even different audio signals may include identical or similar signal components. For example, in television and radio broadcasting, there are many cases in which, in even different conversations or narrations, common background music is superimposed on the sound.
Also, in many cases, at the start or end of a program broadcasted in series, a voice, music, or sound effect is used in common. Moreover, in some commercials provided by a single company, by using common audio-signal components for advertisements for different products, customers can be informed that the products are produced by the single company.
As described above, in many cases, between scenes (sections of a video/audio signal) related to each other, common audio-signal components are used in the background. Therefore, if a partially identical or similar portion of the audio signal can be detected, it is possible to perform high speed retrieval of a scene related to another, such as a similar audio-signal portion, a video-signal portion accompanying it, a related scene in a program, a scene in a series of programs, or a commercial of a single company.
Technologies that compare an input signal with a prerecorded signal and determine whether the signals are identical include, for example, a technology using correlation between an audio signal and its spectrum, and the technology disclosed in Japanese Unexamined Patent Application Publication No. 2000-312343.
In the above technology using correlation, when a plurality of audio signals or different components between the spectra of the signals are sufficiently weak, two signals are correlated with each other while the time between the signals is being shifted, and a correlation at maximum shifted time is used to evaluate similarity. However, when the different components are not sufficiently weak, appropriate evaluation of similarity cannot be performed.
Also, in the technology disclosed in Japanese Unexamined Patent Application Publication No. 2000-312343, only an audio signal identical to that recorded can be detected regardless of a difference caused by some noise.
Accordingly, it is impossible for the above technologies to detect music which is used as background music of a program, or audio-signal components used in different commercials of a single company, as described above.
In order to determine whether an input signal is identical to a prerecorded signal by comparing both signals, a common method of the related art correlates the signals with each other while shifting their time domains, and evaluates similarity based on a correlative value at the time that the maximum correlation is obtained. This method has a problem in that it cannot perform accurate similarity when a plurality of audio signals, and different components between the spectra of the signals are not sufficiently weak. The following is a specific description.
The short-time spectrum distributions (so-called “spectrograms”) of three audio signals (signal A, signal B, and signal C) are shown in FIGS. 1 to 3, respectively. These distributions are obtained from the last two seconds of actually broadcast commercials. Signal A and signal B represent commercials for different products of a single company, and signal C represents a commercial of another company.
As FIGS. 1 and 2 shows, signal A and signal B include acoustically similar components which give an idea of the company, but the signal C in FIG. 3 does not include such a component. From the comparison between the signal-A spectrum distribution and the signal-B spectrum distribution, similar components and superimposition of different components are observed, though both distributions have a temporal shift.
Regarding the three audio signals, the results of correlative calculation on the spectrum of signal B and the spectrum of signal C, which are obtained with sections of the spectrum of signal A used as templates, are shown in FIGS. 4A and 4B. The templates are 0.5-second sections that start at the 0-second position, 0.25-second position, 0.5-second position, 0.75-second position, 1-second position, 1.25-second position, and 1.5-second position of signal A.
FIG. 4A shows the result of correlative detection on the signal-B spectrum. This is obtained by using the template starting at the 1.5-second position of the signal-A spectrum. FIG. 4B shows the result of correlative detection on the signal-C spectrum. This is obtained by using the template starting at the 0.75-second position of the signal-A spectrum.
The maximum correlation between signal A and signal B is 0.657, while the maximum correlation between signal A and signal C is 0.642, which are indicated by the arrows in FIGS. 4A and 4B, so that both have almost no difference. This is not because signal A and signal C are similar to each other, but because different components between signal A and signal B are not weak since both have only a maximum correlation of approximately 0.65.
As described above, the correlation method is not always suitable for detection, classification, and retrieval of similar scenes.