1. Field of the Invention
The present invention relates to a method and apparatus for detecting a speech/non-speech section media contents where voice, music, sound effects, and noise are mixed.
2. Discussion of the Related Art
Various voice activity detection methods have been used to detect a speech section and a non-speech section in media contents.
For example, Korean Patent Publication No. 1999-0039422 (published on Jun. 5, 1999) “A method of measuring voice activity level for G.729 voice encoder” discloses dividing a voice frame into a speech section including voice information and a no-speech section, then dividing the speech section into voiced sounds and voiceless sounds so as to encode the sounds, and then measuring the activity level of sounds by comparing the energy of the voice frame obtained in the process of extracting LPC parameters with a threshold.
Furthermore, Korean Patent Publication No. 10-2013-0085731 (published on Jul. 30, 2013) “A method and apparatus for detecting voice area” discloses determining a speech section and a no-speech section within voice data by using a self-correlation value between voice frames.
However, such conventional methods detect a speech section by simply using a threshold, and thus errors may occur and detection of accurate speech sections may become difficult as noise is mixed and feature vectors significantly change. Furthermore, the conventional methods determine a voice and a no-voice, and thus it is difficult to apply such methods to media contents where music and sound effects, etc. coexist.
Furthermore, the technology of distinguishing voice from music is being developed as a preprocessing technology for improving performance of a voice recognition system. According to the existing voice/music classification methods, methods of distinguishing voice from music using a rhythm change according to time which may be considered as a main characteristic of music have been suggested. However, such methods are relatively slow compared to a voice change and the principle of changing at relatively constant intervals is used, and thus the performance may significantly change as the tempo gets quick and musical instruments change depending on the type of music.
Furthermore, methods of statistically extracting feature vectors having voice/music classification characteristics by utilizing a voice and music database (DB), and classifying voice/music by using a classifier which has been trained based on the extracted feature vectors have been studied. However, such methods require a learning step for voice/music classification of a high performance and a large amount of data needs to be secured for learning and statistical feature vectors need to be extracted based on the data, and thus a lot of effects and time are needed in securing data, extracting valid feature vectors and learning.