Large organizations, such as commercial organizations, financial institutions, government agencies or public safety organizations conduct communication sessions, also known as interactions, with individuals such as customers, suppliers and the like on a daily basis.
Communication sessions between parties may involve exchanging sensitive information, for example any of financial data, transactions and personal medical data. Thus in communication sessions with individuals it may be necessary to authenticate the individual, e.g., ensure that the individual really is who he or she claims to be. Authentication may include checking that identification details provided by an individual match identification details held on record for that individual. Authentication may be required for example before offering an individual any information or services. When a communication session begins a system or agent on behalf of one party may first identify the individual. Some organizations use voice prints to authenticate the identity of individuals.
The term “voice print” as used herein is intended to encompass voice biometric data. Voice prints are also known by various other names including but not limited to spectrograms, spectral waterfalls, sonograms, and voicegrams. Voice prints may take many forms and may indicate both physical and behavioral characteristics of an individual. One type of voice print is in the form of time-varying spectral representations of sounds or voices. Voice prints may be in digital form and may be created from any digital audio recordings of voices, for example but not limited to audio recordings of communication sessions between call center agents and customers. A voice print can be generated in many ways known to those skilled in the art including but not limited to applying short-time Fourier transform (STFT) on various (preferably overlapping) audio streams of a particular voice such as an audio recording. For example each stream may be a segment or fraction of a complete communication session or corresponding recording. A three-dimensional image of the voice print may present measurements of magnitude versus frequency for a specific moment in time.
Some speaker's voices may be extremely difficult to forge for biometric comparison purposes, since a myriad of qualities may be measured ranging from dialect and speaking style to pitch, spectral magnitudes, and format frequencies. For some individuals, the vibration of an individual's vocal chords and the patterns created by the physical components resulting in human speech are as distinctive as fingerprints.
It should be noted that known methods for the generation of voice prints do not depend on what words are spoken by the individual for whom the voice print is being created. They simply require a sample of speech of an individual from which to generate the voice print. As such those methods may be said to be “text-independent”. The larger the sample, the more information may be included in the voice print and the more reliable the voice print will be in authenticating an individual.
Voice prints have been used to authenticate individuals in some kinds of communication session between individuals and service providers. Many known techniques for such authentication require some kind of activity on the part of the individual such as visiting a website or calling a call center to facilitate the creation of the voice print. This requirement has hindered the take-up of voice print technology for user authentication.
Some kinds of communication session use so-called “self-service” channels in which an individual interacts with a machine to conduct a transaction. Some examples of such self-service channels use an interactive voice response “IVR” system in which a user speaks and the system responds with speech. Others simply prompt a user to utter some speech, for example using an instruction in text form. Using a self-service channel, a complete transaction may be concluded between an individual and another party with no human intervention on the part of the other party. Accurate authentication can be particularly important in such situations. Hitherto it has been considered that the use of voice prints, and particularly text-independent voice prints, is typically not suitable for authenticating users of self-service channels because, for example, only short bursts of speech e.g. in the range of 3-5 seconds may be obtained from the user, and experience has shown that the use of such short bursts of speech typically does not lead to adequate performance.