Voice user interfaces are provided to allow a user to interact with a system using their voice. One advantage of this interface, for example in devices such as smartphones, tablet computers and the like, is that it allows the user to operate at least some aspects of the device in a hands-free manner. Speech recognition techniques, i.e. techniques to extract the words spoken from the voice audio signal, may, for example, be used to detect that a particular trigger phrase has been spoken to set the device to expect a spoken command and to recognize a command when spoken and to perform operations in response. For example, if the spoken command asks for publicly available information, then the interface may cause a query to be submitted to an internet search engine in order to be able to supply that information to the user.
However, in other cases, some level of authentication may be desirable to verify the identity of the user before acting on any command, for example if the spoken command relates to personal information, or requests some financial transaction.
To maintain the generally hands-free mode of user interaction, the voice user interface may comprise some form of speaker recognition, i.e. some analysis of the voice audio input signal to extract characteristics of that signal distinctive to one of one or more users. The identity of the user may thus be verified with a high level of confidence with more security than passwords and more conveniently than other biometric verification methods such as fingerprint or iris patterns.
The accuracy of this user verification may be characterized in terms of a false acceptance rate (FAR) and a false rejection rate (FRR). The FAR quantifies the probability that a different user may be falsely authenticated as an authorized user, with obvious financial security and privacy risks to the proper user. The FRR quantifies the probability that a valid user may be rejected, which causes inconvenience to the user, who may then have to repeat his attempt or use some other form of authentication.
The speaker recognition process may rely on comparing spectral characteristics of the current speech samples with those of previously enrolled speech samples. However any background noise during authentication attempts may be superimposed on the speaker's voice and may hide or alter spectral features and thus give errors in the comparison. Background noise during enrollment may conversely add features that are absent when authenticating in a quiet environment. These effects may degrade the FAR or FRR, with the undesirable security or user inconvenience consequences described above.
Attempts to mitigate the problem using signal processing to try and remove the noise added to the signal may affect the spectral characteristics of the resultant compensated speech and thus again degrade the accuracy.
According to an embodiment there is provided an apparatus for use in biometric speaker recognition, comprising:                an analyzer for analyzing each frame of a sequence of frames of audio data which correspond to speech sounds uttered by a user to determine at least one characteristic of the speech sound of that frame; and        an assessment module for determining for the each frame of audio data a contribution indicator of the extent to which the each frame of audio data should be used for speaker recognition processing based on the determined at least one characteristic of the speech sound.        
In some embodiments the apparatus may comprise a speaker recognition module configured to apply speaker recognition processing to the frames of audio data, wherein the speaker recognition module is configured to process the frames of audio data according to the contribution indicator for each frame.
The contribution indicator may comprise a weighting to be applied to the each frame in the speaker recognition processing. In some instances the contribution indicator may comprise a selection of frames of audio data not to be used in the speaker recognition processing.
The speaker recognition processing may comprise processing the frames of audio data for speaker enrollment. The speaker recognition processing may comprise processing the frames of audio data for speaker verification. The speaker recognition processing may comprise processing the frames of audio data for generation of a generalized model of a population of speakers.
The at least one characteristic of the speech sound may comprise identification of the speech sound as one of a plurality of predefined classes of phonemes. The at least one characteristic of the speech sound may comprise identification of the speech sound as a specific phoneme. The contribution indicator for a phoneme or class of phonemes may vary based on the number of previous instances of the same phoneme or class of phoneme in previous frames of audio data.
The at least one characteristic of the speech sound may comprise at least one characteristic of one or more formants in the speech sound. The characteristic may comprise an indication of at least one formant peak and/or an indication of at least one formant null.
The assessment module may be configured to receive an indication of acoustic environment in which the speech sound was uttered by the user. The contribution indicator may also be based on the indication of acoustic environment. The indication of acoustic environment may comprise an indication of noise in the audio data. The indication of noise may comprise an indication of at least one of: noise amplitude level; noise frequency and/or spectrum; noise level relative to signal level for sounds vocalized by the user.
In some embodiments the at least one characteristic of the speech sound comprises identification of the speech sound as one of a plurality of predefined categories of phonemes and for at least one of the predefined categories of phonemes, the assessment modules applies a transfer function between a value of contribution indicator and noise level.
The analyzer may be configured to analyze the audio data to determine said indication of noise. The analyzer may configured to identify frames of the audio signal that do not correspond to sounds vocalized by the user to determine the indication of noise from such frames.
In some embodiments the assessment module is configured such that if the indication of noise is above a first threshold level, then the assessment module indicates that no frames of audio data should be used for speaker recognition processing.
In some embodiments the indication of acoustic environment comprises an indication of reverberation in the audio data. The analyzer may be confirmed to analyze the audio data to determine the indication of reverberation.
In some embodiments the assessment module is configured to receive an indication of a parameter of an acoustic channel for generating the audio data and the contribution indicator is also based on said indication of the parameter of the acoustic channel. The indication of a parameter of the acoustic channel may comprise an indication of a parameter of a microphone used to receive the speech sound uttered by a user. The parameter of a microphone may comprise a microphone resonance. The indication of a parameter of the acoustic channel may comprise an indication of bandwidth of the audio channel.
In some embodiments the assessment module is configured to receive an indication of a speech characteristic derived from speech sounds previously uttered by the user and wherein the contribution indicator is also based on the indication of the speech characteristic. The indication of the speech characteristic may comprise an indication of the pitch of the user and/or an indication of the nasality of the user.
In some embodiments the assessment module is configured to receive an indication of at least one enrolled user profile and wherein the contribution indicator is also based on said indication of the enrolled user profile. The indication of at least one enrolled user profile may comprise an indication of a user profile most relevant for the speaker recognition processing. The indication of a user profile most relevant for the speaker recognition processing may be derived from the speaker recognition processing.
In some embodiments the assessment module is configured such that the contribution indicator for a frame of audio data is based on the determined at least one characteristic of the speech sound and on the number of previous frames of audio data where the determined at least one characteristic was similar.
The speaker recognition module may be operable in a verification mode to process said frames of audio data to determine one or more features of speech sounds of said frames of data and to compare said one or more features with at least one user model for an enrolled user to determine a confidence level indicative of whether or not the current speaker is that enrolled user. The speaker recognition module may be configured to determine, for a plurality of frames of the audio data, a frame confidence score indicative of a degree of matching between that frame of audio data and the at least one user model and combine a plurality of frame confidence scores to determine the confidence level wherein the combination of frame confidence scores is based on the contribution indicators for the relevant frames. The speaker recognition module may be operable to not process some frames of data to generate a frame confidence score and/or omit the frame confidence score for at least some frames of audio data from combination to form the confidence level based on the contribution indictor for said frames of audio data. Additionally or alternatively the speaker recognition module may be operable to apply a weighting to at least some of the frame confidence scores based on the contribution indictor for said frames of audio data.
The speaker recognition module may be operable in an enrolment mode to process said audio signal to form a user model for an enrolling user. The speaker recognition module may be operable to not process some frames of data to form said user based on the contribution indictor for said frames of audio data.
The apparatus may further comprise a speech recognition module configured to analyze said frames of audio data.
The apparatus may have a microphone for generating an audio signal corresponding to speech sounds uttered by the user.
The apparatus may be implemented as an integrated circuit.
Embodiments also relate to electronic devices comprising an apparatus as described by any of variants outlined above. The electronic device may be at least one of: a portable device; a communication device; a mobile telephone; a computing device; a laptop, notebook or table computer; a gaming device; a wearable device; a voice controllable device; an identity verification device; a wearable device; or a domestic appliance.
Embodiments also relate to apparatus for use in biometric speaker recognition comprising:                an assessment module for determining for a sequence of frames of audio data which correspond to speech sounds uttered by a user a contribution indicator of the extent to which a frame of audio data should be used for speaker recognition processing based on at least one characteristic of the speech sound to which the frame relates.        
Embodiments also relate to a method of speaker recognition, comprising: analyzing each frame of a sequence of frames of audio data which correspond to speech sounds uttered by a user to determine at least one characteristic of the speech sound of that frame; and
determining for the each frame of audio data a contribution indicator of the extent to which the each frame of audio data should be used for speaker recognition processing based on the determined at least one characteristic of the speech sound.
Embodiments also relate to a non-transitory computer-readable storage medium having machine readable instructions stored thereon that when executed by a processor, cause the processor to perform the method as described. Aspects also relate to an apparatus comprising a processor and such a non-transitory computer-readable storage medium.