Speaker Change Detection (SCD) is a task to detect in an audio stream the change of speakers during conversations. An efficient and accurate speaker change detector can be used to partition conversations into homogeneous segments, where only one speaker is present in each segment. Speaker recognition or verification can then be performed on the clustered speaker segments, rather than on a frame-by-frame basis, to improve accuracy and reduce cost. However, SCD is challenging when the system has no prior information regarding the speakers, and it is usually required to detect speaker change in real-time, within a predetermined limit delay, e.g. within 1 or 2 seconds of speech.
SCD can be divided into retrospective vs. real-time detection. Retrospective detection is normally based on model training for speakers and a detection algorithm, using Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs), etc. It includes approaches with different thresholding criteria, such as Bayesian Information Criterion (BIC), Kullback-Leibler (KL)-based metrics, etc. In the case of real-time detection, the speaker change decision has to be made using limited preceding data with low computational cost. Research has been focused on improving features and developing efficient distance metrics. Lu et al. (Lie Lu and Hong-Jiang Zhang, “Speaker change detection and tracking in real-time news broadcasting analysis,” in Proceedings of the tenth ACM international conference on Multimedia. ACM, 2002, pp. 602-610) obtained reliable change detection in real-time news broadcasting with the Bayesian feature fusion method. In the evaluation using TIMIT synthesized data by Kotti et al. (Margarita Kotti, Luis Gustavo P M Martins, Emmanouil Benetos, Jaime S Cardoso, and Constantine Kotropoulos, “Automatic speaker segmentation using multiple features and distance measures: A comparison of three approaches,” in IEEE ICME' 06), the mean F1 score was 0.72 and it observed a significant drop in accuracy for speaker change with durations less than 2 seconds. Another work from Ajmera et al. (Jitendra Ajmera, Iain McCowan, and Herve Bourlard, “Robust speaker’change detection,” IEEE signal processing letters, 2004) reported 81% recall and 22% precision using BIC and log-likelihood ratios on HUB-4-1997 3-hour news data.
There therefore remains a need for improved systems and methods that can detect speaker change in an audio stream using limited preceding data with low computational cost.