In many automatic speaker verification applications, e.g., voice biometric authentication, in order to protect against malicious spoofing attacks, there is need for reliable differentiation of speech from a human talker and that from an artificial talker. An example of this is in voice biometric authentication for smart home applications. A smart home device is an electronic device configured to receive user speech input, process the speech input, and take an action based on the speech input. FIG. 1 shows a smart home device in a room. A living room 100 may include a smart home device 104. The smart home device 104 may include a microphone or an array of microphones, a speaker, and electronic components for receiving speech input. Individuals 102A and 102B may be in the room and communicating with each other or speaking to the smart home device 104. Individuals 102A and 102B may be moving around the room, moving their heads, putting their hands over their face, or taking other actions that change how the smart home device 104 receives their voices. In this scenario, voice biometric authentication may be used to identify a speaker interacting with the smart home device and provide some security.
However, conventional voice biometric authentication is susceptible to spoofing attacks. In a spoofing attack, an unauthorized user attempts to impersonate an authorized user to gain access to the user's account. Spoofing with a smart home device may include the unauthorized user playing back recorded speech of the authorized user from a playback device such as a mobile phone. Such a recording could be easily obtained when that unauthorized user is in the room 100 with the authorized user who passed the voice authentication by speaking to the device 104, as shown in FIG. 1. After the authorized user leaves, the unauthorized user may play back the recorded voice commands. The unauthorized user may also send the voice recordings to other unauthorized users, who then spoof the authorized user on any similar smart home devices.
Conventional automatic speaker verification systems rely on spectral features such as mel-frequency cepstral coefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear prediction (PLP), and i-vectors, that are usually modeled using Gaussian mixture models (GMMs) and then classified. Conventional single-microphone anti-spoofing methods attempt to detect differences in MFCCs, pitch, relative phase shift, spectral modulation, and channel pattern noise between the real and synthetic speech. Some methods make use of cut-and-paste detection and repeated sample detection to differentiate authentic speech from playback. However, automatic speaker verification systems are still susceptible to spoofing attacks. In particular, conventional single-microphone authentication methods rely on spectral information alone and are vulnerable to replay attacks, in which a high-quality loudspeaker is used to play a recording of the desired talker and gain unauthorized access.
Shortcomings mentioned here are only representative and are included simply to highlight that a need exists for improved electrical components, particularly for audio processing and user authentication employed in consumer-level devices. Embodiments described herein address certain shortcomings but not necessarily each and every one described here or known in the art. Furthermore, embodiments described herein may present other benefits than, and be used in other applications than, those of the shortcomings described above. For example, similar shortcomings may be encountered in other audio devices, such as mobile phones, and embodiments described herein may be used in mobile phones to solve such similar shortcomings as well as other shortcomings.