As an increasing number of applications and services are made available through voice interfaces, speech technology is increasingly being used to classify audio data as authentic or otherwise.
Advances in speech synthesis are increasing as well. For example, users of voice interfaces can converse with human-sounding artificial intelligence (AI) to schedule appointments, make reservations, add items to a shopping cart, and so on. In many situations, speech technology is being adopted across financial, government, health and educational sectors for its implications on security and customer experience.
While advances in speech technology have numerous beneficial applications, this confluence of improvements in speech synthesis and the pervasiveness of voice data as a mode of interaction creates an opportunity for speech-based fraud, for financial gain, dissemination of propaganda, identity fraud, and more.
Various approaches have been used to classify speech data. In many cases, however, these approaches rely on a selection of top down acoustic features that differ between authentic and inauthentic speech data. This approach can be time-consuming and may not be able to keep pace with the rapid developments in speech synthesis or at least may provide less than optimal classification results. This is particularly troubling as voice interfaces are increasingly being deployed in environments where sensitive information is exchanged. A need exists, therefore, for systems, methods, and devices that overcome this disadvantage.