Speaker diarization refers to the technical field of using a computing system to execute mathematical algorithms (e.g., machine learning algorithms) to identify, within a single digital audio file, segments of audio that correspond to different speakers and linking segments from the same speaker. Whereas speaker segmentation may refer simply to the task of identifying different speaker turns (e.g., conversation changes from one speaker to the next), the end result of diarization may also be used in combination with automatic speaker recognition technology to identify (name) the speaker associated with each of the different-speaker segments. Conventional speaker diarization approaches typically seek to segment the entire file; i.e., to identify all changes in speaker turns and all speakers that occur within the audio file. For example, speaker diarization has been used to automatically annotate speaker turns in TV and radio transmissions and, with the aid of automatic speech recognition technology, provide automatic transcription of “who said what” in conference meetings, where the same had been traditionally performed manually by hand. However, most research in the area of automated speaker diarization has focused on the use of unsupervised machine learning techniques.
Speech activity detection (SAD) can be used as a pre-processing step, in order to pre-identify audio segments that contain speech (e.g., as opposed to pauses). After performing the SAD, the audio segments can be analyzed further. Metrics that can be used to quantify the objectives and/or effectiveness of speaker diarization systems include precision and recall. Precision refers to a measure of the accuracy with which audio is associated with a particular speaker, i.e., the proportion of correctly labeled audio in relation to the audio labeled as the speaker. Precision takes into account false alarms (segments incorrectly identified as associated with a particular speaker). Recall, on the other hand, refers to a measure of speech identified with a particular speaker in relation to the total amount of speech from that speaker (i.e., how much of the speaker's speech was found), and therefore takes into account miss rate (how much speech from the speaker was missed). High precision can be desirable for some applications, such as speaker recognition, while high recall can be important for other applications (e.g., transcription of meetings).