Modern systems are increasingly voice-centric, and in many cases rely on voice-based security approaches, including Automated Voice Recognition (AVR) and the like to ensure that a current speaker is an authorized user of the system. Although various approaches realize moderate success in ensuring that a received audio sample matches a previously enrolled audio sample or corresponding voice model, conventional systems are vulnerable to “spoofing” attacks in which a fraudulent user may employ techniques like voice conversion, speech synthesis and replay attacks to substantially approximate the authentic enrollee. Fraudulent replay attacks, for example, are easy to generate with no expertise required in speech processing and machine learning. With use of high-quality playback and recording devices, it is conceivable to make replay attacks indistinguishable from a genuine access in conventional systems.
Constant Q Cepstral Coefficients (CQCCs) are perceptually-inspired time-frequency analysis acoustic features that are found to be powerful at detecting voice spoofing attacks, namely audio playback, voice conversion and morphing, and speech synthesis attacks. (See, e.g., Todisco et al., “A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients,” Odyssey 2016.) However, drawbacks of the conventional technique for obtaining CQCCs include high costs in terms of memory usage and processing time. Moreover, conventional systems employee CQCC features discriminate only between spoofed and non-spoofed utterances.