The analysis of vocal signals requires in particular the ability to represent a speaker. The representation of a speaker by a mixture of Gaussians (“Gaussian Mixture Model” or GMM) is an effective representation of the acoustic or vocal identity of a speaker. According to this technique, the speaker is represented, in an acoustic reference space of a predetermined dimension, by a weighted sum of a predetermined number of Gaussians.
This type of representation is accurate when a large amount of data is available, and when there are no physical constraints in respect of the storage of the parameters of the model, or in respect of the execution of the calculations on these numerous parameters.
Now, in practice, to represent a speaker within IT systems, it transpires that the time for which a speaker is talking is short, and that the size of the memory required for these representations, as well as the times for calculations with regard to these parameters are too big.
It is therefore important to seek to represent a speaker in such a way as to drastically reduce the number of parameters required for the representation thereof while maintaining correct performance. Performance is meant as the error rate of vocal sequences that are not recognized as belonging or not to a speaker with respect to the total number of vocal sequences.
Solutions in this regard have been proposed, in particular in the document “SPEAKER INDEXING IN LARGE AUDIO DATABASES USING ANCHOR MODELS” by D. E. Sturim, D. A. Reynolds, E. Singer and J. P. Campbell. Specifically, the authors propose that a speaker be represented not in an absolute manner in an acoustic reference space, but instead in a relative manner with respect to a predetermined set of representations of reference speakers also called anchor models, for which GMM-UBM models are available (UBM standing for “Universal Background Model”). The proximity between a speaker and the reference speakers is evaluated by means of a Euclidean distance. This enormously decreases the calculational load, but the performance is still limited and inadequate.