Speech intelligibility is the psychoacoustics metric that enhances the proportion of an uttered signal correctly understood by a given subject. Recognition tasks include phone, syllable, words, up to entire sentences. The ability of a listener to retrieve speech features is submitted to external features such as competing acoustic sources, their respective spatial distribution or presence of reverberant surfaces; as well as internal such as prior knowledge of the message, hearing loss, attention. The study of this paradigm, mentioned as the “cocktail party effect” by Cherry in 1953 has motivated numerous research.
Formerly known as the Articulation Index from French and Steinberg (1947), resulting from Fletcher's life long multiple discoveries and intuition, the Speech Intelligibility Index (SII ANSI-1997) aims at quantifying the amount of speech information available left after frequency filtering or masking of speech by stationary noise. It is correlated with intelligibility, and mapping functions to the latter are established for different recognition tasks and speech materials. Similarly Steeneken and Houtgast (1980) developed the Speech Transmission Index that predicts the impact of reverberation on intelligibility from the speech envelop. Durlach proposed in 1963 the Equalization and Cancellation theory that aims at modelling the advantage of monaural over binaural listening present when acoustic sources are spatially distributed. The variability of the experimental methods used inspired Boothroyd and Nittrouer who initiated in 1988 an approach to quantify the predictability of a message. They set the relation between the recognition probabilities of an element and the whole it composes.
However accurate these methods have proven to be, they apply to maskers with stationary properties. The very common case of the competing acoustic source being another source of speech cannot be enhanced by these methods as speech is non-stationary by definition. In the meanwhile, communication with multiple speakers is bound to increase, while non-stationary sources severely impair the listeners with hearing loss, the later emphasizing the cocktail party effect.
If one aims at predicting situations that are to vary, it is necessary to include the variable time in models, and consequently these should progressively become signal-based. In 2005, Rhebergen and Versfeld proposed a conclusive method for the case of time fluctuating noises. However, the question of speech in competition with speech remains. Voice similarity, utterance rate and cross semantics are some of the features that add to the variability in the attention as artifacts on the recognition performances by the listener.
Generative models such as Gaussian Mixture Models are known (see, e.g., McLachlan, G. J. and Basford, K. E. “Mixture Models: Interference and Applications to Clustering”, Marcel Dekker (1988)).