The goal of the diarization systems is to extract information about the speakers found in an audio document, such as the number of speakers, their turns, the timing of their turns, etc. In more detail, the diarization systems first find the speaker turns, and then extract the corresponding information about the homogeneous speech segments attributed to a single speaker. The state-of-the-art approach for finding such turns is to use a sliding window of fixed length and investigate whether the speakers have changed inside that window. This brute-force approach estimates two Gaussian models for the left and right sub-segments and then compares the corresponding statistics. Once these candidate speaker turns are found, a second module assigns these sub-segments to speaker clusters. Most often, the brute-force system creates a big number of false positives (i.e. the system finds a speaker turn when there is no actual turn). In order to remove some of these false positives, the current systems rely on the clustering post-processing scheme. However, less accurate turn detection actually hurts the clustering process as well. There are several problems with this approach: First, the audio in this window may contain non-verbal acoustic cues, such as silence, noise, background speech, etc. These cues cause artifacts that may skew the estimated statistics, thus make the turn detection process noisier. Further, these artifacts can also deteriorate the clustering performance, lowering the overall diarization performance. Second, the detected turns are often found in the middle of a word, and consequently the ASR (automated speech recognition) performance will be lower (when diarization is combined with an ASR system). In more detail, there is no constraint where the speaker turns can be found, so it is possible that they can be found in the middle of a word, or even when there is a transition from silence to speech (and vice versa). The ad-hoc turns create discontinuities in the speech flow and thus, ASR performance becomes worse. Finally, there is no constraint in the length of the two sub-segments. In such cases, sub-optimal statistics estimation is caused affecting all the sub-sequent processes, as well.