Video conferences, which include audio and visual components, are increasingly used as a way to facilitate meetings and to share information during the meetings. Typically, audio-based speaker segmentation is performed on audio components of a video conference to identify different speakers, e.g., to extract meta-data associated with speakers in a conference. Often, fluctuations in a voice of a single speaker may be attributed by an audio-based speaker segmentation algorithm to more than the single speaker, as extrinsic factors may affect the performance of the audio-based speaker segmentation algorithm. By way of example, extrinsic variations such as head movement of a speaker, movement of the speaker with respect to a microphone, and/or background noise may have an adverse effect on the accuracy with which an audio-based speaker segmentation algorithm performs.