In broadcasting system and/or multi-media system, etc., it is important to efficiently perform management and classifying (sorting) of large contents such as image or speech to easily permit retrieval of such contents. In this case, in order to perform such operation, it is indispensable to recognize information that respective portions in contents have.
Here, many multimedia contents and/or broadcasting contents include audio signal along with video signal. Such audio signal is very useful information in classifying (sorting) of contents and/or detection of scene. Particularly, speech portion and music portion of audio signal included in information are detected in a manner such that they are discriminated, thereby making it possible to perform efficient information retrieval and/or information management.
Meanwhile, as a technology for discriminating between speech and music, a large number of technologies have been conventionally studied. There are proposed techniques of performing such discrimination using, as feature quantity, zero cross number, change (fluctuation) of power and/or change (fluctuation) of spectrum, etc.
For example, in the literature ‘J. Saunders, “Real-time discrimination of broadcast speech/music”, USA, Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, 1996, pp. 993-996, discrimination of speech/music is performed by using zero cross number.
Moreover, in the literature ‘E. Scheire & M. Slaney, “Costruction and evaluation of a robust multifeature speech/music discriminator”, USA, Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, 1997, pp 1331-1334, 13 feature quantities including 4 Hz modulation energy, low energy frame rate, spectrum roll-off point, spectrum centroid, spectrim change (Flux) and zero cross rate, etc. are used to discriminate between speech/music to compare and evaluate respective performances.
Further, in the literature ‘M. J. Care, E. S. Parris & H. Lloyd-Thomas, “A comparison of features for speech, music discrimination”, USA, Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, 1999, March, pp. 149-152, cepstrum coefficient, delta cepstrum coefficient, amplitude, delta amplitude, pitch, delta pitch, zero cross number, and delta zero cross number are caused to be feature quantities, and mixed normal distribution model is used for respective feature quantities to thereby discriminate between speech/music.
In addition to the above, detection technique based on the feature that spectrum peak of music is continued in the time direction while it is stabilized so as to have specific frequency is also studied. Here, stability of spectrum peak is represented also as presence or absence of linear component in the time direction in the spectrogram. The spectrogram is diagram in which frequency is taken on the ordinate and time is taken on the abscissa, and spectrum components are arranged in the time direction to represent the spectrum as image information. As an invention using this feature, there are mentioned, e.g., the literature “Minami, Akutsu, Hamada & Sotomura, “Image Indexing Using Sound Information and its Application”, Electronic Information Communication Associates Collection D-11, 1998, J81-th-D- volume 11, No. 3, pp. 529-537”, and the Japanese Patent Application Laid Open No. H10-187182.
Such a technology of discriminating and classifying (sorting) speech and music, etc. every predetermined time is applied to thereby have ability to detect start/end position of continuous time period of the same kind or category in audio data.
However, in detecting continuous time period of the same kind by directly using the above-described technology of discriminating and classifying (sorting) kind of speech or music, etc., there exist the following problems.
For example, there are many instances where music consists of many musical instruments, singing speech, sound effect or rhythm by beat musical instrument, etc. Accordingly, in the case where audio data is discriminated every short time, not only portions such that can be necessarily discriminated as music, but also portions to be judged as speech when viewed from short time range, or portions which should be classified (sorted) as other kind are frequently included even during continuous musical time period. Also in the case where continuous time period of conversational speech is detected, it may frequently take place that soundless portion and/or noise such as music, etc. are momentarily inserted similarly even during continuous conversational time period. In addition, even if corresponding portion is portion of clear music or speech, that portion may be erroneously discriminated as erroneous kind by discrimination error. This similarly applies to the case of kind except for speech and/or music.
Accordingly, in the case of a method of detecting continuous time period by directly using kind discrimination result of speech/music, etc. every short time, there takes place the problem that the portion which should be considered as continuous time period when viewed from the long time range may be interrupted in the middle thereof, or temporary noise portion which cannot be considered as continuous time period for the long time range may be conversely considered as continuous time period.
On the other hand, if analysis time for discrimination is elongated for the purpose of avoiding such problem, there takes place the problem that time resolution of discrimination is lowered so that detection rate is lowered in the case where music/speech, etc. is frequently switched.