1. Field of the Invention
The present invention is related to a technology of segmenting video and audio into clips. More particularly, the present invention is related to segmenting video and audio into clips using speaker recognition and dividing the audio and video.
2. Brief Description of the Related Art
Nowadays, as the time goes by, videos contain more and more information and are widely varied. It is an issue for the audience to quickly retrieve important contents from various and numerous videos. Generally, videos on the internet have been manually segmented and are easier for a user to retrieve the contents thereof. For dealing with numerous videos, it is important to develop a technology for automatically segmenting videos and audios.
Conventional technology for automatically segmenting audio and video is configured to use the video signals by detecting a particular image for analyzing and sorting first, and then segmenting the audio and video into clips. A conventional technology of “Anchor Person Detection For Television News Segmentation Based On Audiovisual Features” is disclosed in Taiwan Patent No. 1283375, as shown in FIG. 1. As shown in FIG. 1, the conventional technology comprises steps of: scanning pixels of video frames with a first horizontal scan line to determine if colors of the pixels fall within a predetermined color range; creating a color map utilizing pixels located on the first horizontal scan line from a plurality of successive video frames; labeling the current video segment as a candidate video segment if the color map indicates the presence of a stable region of pixels falling within the predetermined color range for a predetermined number of successive video frames; and performing histogram color comparisons on the stable regions for detecting shot transitions. Audio signals of the video clips may also be analyzed to further verify the candidate video segments. However, the conventional method uses the scan line for analyzing color distribution in videos, and depends on the pixels for segmenting videos. If the videos vary frequently, the accuracy is low.
Another conventional automatic segmenting method uses audio signals for segmenting of the videos. A conventional technology of “Method of real-time speaker change point detection, speaker tracking and speaker model construction” is disclosed in U.S. Pat. No. 7,181,393 B2, as shown in FIG. 2. The method comprises two stages. In the pre-segmenting stage, the covariance of a feature vector of each segment of speech is built initially. A distance is determined based on the covariance of the current segment and a previous segment; and the distance is used to determine if there is a potential speaker change between these two segments. If there is no speaker change, the model of current identified speaker model is updated by incorporating data of the current segment. Otherwise, if there is a speaker change, a refinement process is utilized to add additional audio characteristics to calculate a hybrid probability. A particular probability determination mechanism is then applied for confirming if there is a speaker change point. However, this method has to calculate distances of a plurality of audio characteristics in two next clips and requires large calculation capacity, which is difficult to apply.