Speaker recognition systems usually perform well when no malicious intention is used by people using said speaker recognition. However, several different possible ways to attack speaker recognition systems with a malicious intention are known.
They comprise for example synthesized attacks or non-synthesized attacks. Synthesized attacks comprise for example text-to-speech attacks generating the corresponding audio data by adapting the voice to a certain speaker. For the text-to-speech attacks, Hidden Markov Models may be used. Synthesized attacks also comprise for example audio data generated by voice conversion algorithms that may be used to create artificial signals mimicking a certain speaker. Non-synthesized attacks comprise for example recording attacks, e.g. far field recording attacks, where utterances of a certain speaker are recorded by a far field microphone. Then, the recording may be used directly or from this recording a pass phrase may be extracted, for example, by cut and paste approaches. Non-synthesized attacks also comprise the imitation of a certain speaker by an imposter changing one or more parameters of the voice to adapt to the characteristics of the speaker to be imitated. For example, document “Speaker Verification Performance Degradation against Spoofing and Tampering Attacks” by J. Villalba and E. Lleida published in Proceedings of FALA 2010 or document “Spoofing and Countermeasures for Automatic Speaker Verification” by ] N. Evans, T. Kinnunen, J. Yamagishi in Proceedings of Interspeech 2013, provides further information on such attacks.
Any such attacks are called spoofs. If they are based on audio data being replayed using a loud speaker, they are also called replay attacks. Such attacks may also be based on inserting the information directly into the speaker recognition system (direct injection).
Although there are approaches in the prior art to detect certain types of spoofs like replay attacks, for example, using support vector machine based approaches e.g. in “Detecting Replay Attacks from Far-Field Recordings on Speaker Verification Systems” by J. Villalba and E. Lleida published in Proceedings of Biometrics and ID Management-COST 2101 European Workshop, Bio ID 2011, these often work only for certain types of spoofs and/or adaptation to new conditions is not allowed.