1. Field of the Invention
The invention concerns a method for the voice recognition of a speaker using a predictive model.
It is more particularly concerned with a method for the voice recognition of a speaker using a vector autoregressive type predictive model.
The method applies equally well to identifying a speaker and to identifying changes of speakers.
It finds applications in many fields and more particularly in access control. Access control is effected by verifying one or more utterances of a speaker.
It finds one particular, although not exclusive, application in the following two fields: access authorization using a sound lock and authorization of access to confidential information: validation of financial operations and/or transactions, secure access to remote information services, etc.
2. Description of the Related Art
The prior art methods usually include a learning mode and a verification mode. They include some or all of the following phases and steps: identity declaration and service request steps (phase 1), steps authorizing learning of one or more utterances for a speaker (phase 2), steps authorizing verification of one or more utterances for a speaker (phase 3), steps of extracting statistical characteristics of one or more utterances (phase 4), steps of calculating the predictive model from statistical characteristics of one or more utterances (phase 5), steps of verification of the identity from the utterance (phase 6), phases of identifying the speaker from the utterance (phase 7), steps of authorizing access to all or some of the resources requested in the first phase (phase 8) and steps of updating the dictionary of statistical characteristics and the predictive model corresponding to some or all of the utterances of a speaker (phase 9).
The first phase enables the speaker to declare an identity and to request an operating mode (learning or verification) with the aim of accessing one or more resources.
The second phase enables a speaker to use the learning mode of the access device.
The third phase enables each speaker to use the verification mode of the access device.
The fourth phase includes a step of digital acquisition and filtering of one or more utterances, a step of extracting vectors of size p, a step of calculating q+1 correlation matrices of size pxc3x97p of some or all of the calculated vectors. The q+1 matrices constitute the statistical characteristics of the utterance of the speaker.
The fifth phase includes a step of calculating q prediction matrices of size pxc3x97p from the correlation matrices, a step of calculating the inverse of the associated error matrix. These q+1 matrices constitute the predictive model of the utterance of the speaker. The references of the utterance of a speaker comprise the statistical characteristics and the associated predictive model.
The sixth phase includes a step of calculating measured resemblances between the statistical characteristics of one or more utterances and some or all of the utterance references from the dictionary and a step of calculating the probability of identity verification.
The seventh phase includes a step of calculating measured resemblances between statistical characteristics of one or more utterances and some or all of the references from the dictionary, a step of searching for the references nearest the utterance and a step of calculating the probability of the identification of the speaker.
The eighth phase authorizes access to some or all of the resources requested in the first phase.
The ninth phase is used to update the references of the utterance of a speaker in the dictionary or to add references of a new speaker to the dictionary.
Automatic verification of the speaker consists in verifying the identity of a person in accordance with a voice sample. Two decisions are possible, in accordance with a binary scheme: xe2x80x9cauthenticationxe2x80x9d or xe2x80x9cnon-authentication of identityxe2x80x9d.
Of the many prior art documents relating to speaker verification methods, the article by Claude Montacixc3xa9 and Jean-Luc Le Floch: xe2x80x9cDiscriminant AR-Vector Models for Free-Text Speaker Verificationxe2x80x9d published in xe2x80x9cCongrxc3xa8s EuroSpeech 1993xe2x80x9d, pages 161-164, may be cited as one non-exhaustive example. The article discloses a method for automatically verifying the speaker but does not explain the conditions for extracting parameters for obtaining a system for automatic representation of the speaker that performs well, is fast and works in a noisy environment.
The above mentioned methods applying, as already indicated, equally to identifying a speaker or to detecting changes of speakers, it is necessary to take account of the physiological characteristics of the human voice, among other factors. In particular, according to whether a man or woman is speaking, the fundamental periods of complex voice signals respectively correspond to frequencies around 100 Hz and 200 Hz. Time windows defined hereinafter are used during the fourth phase mentioned above. In the speech processing art it is accepted that the time windows must be larger than the aforementioned fundamental period. In other words, the analysis applies to a period greater than the fundamental period. As a result the windows usually employed are typically in the range from 15 ms to 40 ms. Trials have shown that performance begins to drop off if this time interval is reduced.
Also, a plurality of overlapping windows are usually employed. It is also accepted that the spacing between two consecutive windows, defined as the time period between the centers of the windows, must be in the order of approximately 10 ms, or more.
Surprisingly, it has been found that adopting values much lower than the aforementioned values improved performance and obtained better results.
To be more precise, in accordance with the invention, the duration of the window must be less than 10 ms.
The fundamental period being around 5 ms for women and 10 ms for men, a window equal to the average fundamental period (i.e. 7.5 Ms) is preferably chosen.
Similarly, the window spacing chosen is less than 4.5 ms.
Values much lower than this value are preferably chosen, for example 2 ms.
The present invention therefore concerns a method for the voice recognition of a speaker using a predictive model which offers improved performance while retaining the technical characteristics and the advantages of the prior art methods.
The method in accordance with the invention is particularly suited to predictive models of the vector autoregressive type.
The invention therefore consists in a method for the voice recognition of a speaker using a q-order predictive model comprising at least one phase of extracting statistical characteristics including at least one step of digital acquisition of a voice sample of particular duration D of the speaker corresponding to at least one utterance of the speaker, a step of converting said voice sample into a sequence of vectors of particular size p obtained from a sequence of analysis windows of average size T and with average spacing I, and a step of determining q+1 correlation matrices from this sequence of vectors, where p and q are non-zero integers, characterized in that said average size T has a duration less than 10 ms.
It also consists in the application of a method of the above kind to identifying a speaker or to detecting changes of speakers.
It further consists in the application of the above method to access control using a sound lock.
It finally consists in the application of the above method to controlling access to confidential information, in particular validation of financial operations and/or transactions, secure access to remote information services.