The ability to recognize a voiced sound pattern (e.g., a keyword or a phrase), as vocalized by a particular speaker, is a basic function of the human auditory system. However, this psychoacoustic hearing task is difficult to reproduce using previously known machine-listening technologies because spoken communication often occurs in adverse acoustic environments that include ambient noise, interfering sounds, and background chatter of other speakers. The problem is further complicated because there is often some variation in how a particular speaker vocalizes multiple instances of the same voiced sound pattern (VSP). Nevertheless, as a hearing task, the unimpaired human auditory system is able recognize VSPs vocalized by a particular speaker effectively and perceptually instantaneously.
As a previously known machine-listening process, recognition of a VSP as vocalized by a particular speaker includes detecting and then matching a VSP to the vocal characteristics of the particular speaker. Known processes that enable detection and matching are computationally complex, use large memory allocations, and yet remain functionally limited and highly inaccurate. One persistent problem includes an inability to sufficiently train a detection and matching system using previously known machine-listening technologies. For example, hierarchical agglomerative clustering (HAC) processes have been previously used to segment a single vocalization instance of a VSP from a particular speaker as a part of a training process. However, a single vocalization instance does not provide a sufficient amount of information to reliably train a VSP detection module.
Due to the computational complexity and memory demands, previously known VSP detection and speaker matching processes are characterized by long delays and high power consumption. As such, these processes are undesirable for low-power, real-time and/or low-latency devices, such as hearing aids and mobile devices (e.g., smartphones, wearables, etc.).