The ability to recognize speech of a particular speaker is a basic human auditory system function. However, this function is notoriously difficult to reproduce using previously known machine-listening technologies because spoken communication often occurs in adverse acoustic environments. The problem is also complicated because how a person speaks the same words often varies between different utterances. Nevertheless, the unimpaired human auditory system is able to recognize speech effectively and perceptually instantaneously.
As a previously known machine-listening process, speech recognition (and subsequent re-synthesis) often includes recognizing phonemes using statistical formalisms such as neural networks. Phonemes are a basic representation of information bearing vocalizations. However, the previously known neural network approaches have a number of drawbacks. First, for example, in order to improve performance, previously known neural network approaches are heavily dependent on language-specific models, which make such approaches language-dependent. Second, many of the previously known neural network approaches recognize phonemes too slowly for real-time and/or low-latency applications because they are reliant on look-ahead information in order to provide context. Third, previously known neural network approaches are becoming increasingly computationally complex, use ever-larger memory allocations, and yet remain functionally limited and highly inaccurate—especially for problematic phonemes that are difficult to detect and are frequently misidentified as other similar sounding phonemes.
Due to increasing computational complexity and memory demands, previously known phoneme recognition neural network approaches are characterized by long delays and high power consumption. As such, these approaches are undesirable for low-power, real-time and/or low-latency devices, such as hearing aids and mobile devices (e.g., smartphones, wearables, etc.).