With the fast development of video conference technologies, a demand for recording conference minutes, which is similar to that for manual generation of conference minutes in a common conference process, also exists in a multipoint video conference. An existing product can automatically record content of an entire conference, such as audio, video, and data, in a video conference process. However, if only audio and video data is recorded, an organizing requirement for classifying the conference minutes according to the speakers fails to be met when the important content or specific content of the conference needs to be reviewed.
In an ongoing video conference, if it can be determined that only one person speaks in an entire voice file, the voice data of the entire file can be directly sent to a voiceprint identification system for identification. If more than one person's voices exist in the voice file, the voice file needs to be segmented and then the voiceprint identification needs to be performed on each segment of the voice data. An existing voiceprint identification system generally requires the voice data more than 10 seconds. The longer the voice data is, the higher the accuracy is. Therefore, a segment cannot be too short during the segmentation of the voice data. Because a considerable number of free talk scenarios exist in the video conference, a voice segment may include more than one person's voices when a segment of the voice data is relatively long. An identification result is not reliable when the segment of the voice data of more than one person is sent to the voiceprint identification system for identification.