The ability to recognize a voiced sound pattern (e.g., a keyword or a phrase), as vocalized by a particular speaker, is a basic function of the human auditory system. However, this psychoacoustic hearing task is difficult to reproduce using previously known machine-listening technologies because spoken communication often occurs in adverse acoustic environments that include ambient noise, interfering sounds, and background chatter of other speakers. The problem is further complicated because there is often variation in how a particular speaker vocalizes the same voiced sound pattern (VSP) in different instances. Nevertheless, as a hearing task, the unimpaired human auditory system is able recognize VSPs vocalized by a particular speaker effectively and perceptually instantaneously.
As a previously known machine-listening process, recognition of a VSP as vocalized by a particular speaker includes detecting and then matching a VSP to the vocal characteristics of the particular speaker. Known processes that enable detection and matching are computationally complex, use large memory allocations, and yet still remain functionally limited and highly inaccurate. One persistent problem includes an inability to sufficiently train a detection and matching system using previously known technologies. In particular, previously known technologies are limited to using a single vocalization instance at a time during the training process, because the processes employed cannot jointly utilize multiple vocalization instances without excessive multiplicative increases in computational complexity and memory demands. However, a single vocalization instance does not provide a sufficient amount of information to reliably train a VSP detection module.
Due to the computational complexity and memory demands, previously known VSP detection and speaker matching processes are characterized by long delays and high power consumption. In turn, these processes are undesirable for low-power, real-time and/or low-latency devices, such as hearing aids and mobile devices (e.g., smartphones, wearables, etc.). Also, the performance of previously available systems disproportionally deteriorates in response to signal-to-noise ratio (SNR) deterioration.