1. Field
The field relates to speaker recognition by voice, particularly to automatic, automated, and expert methods for identification of a speaker by phonograms of spontaneous oral speech, the methods intended for, including but not limited to, forensic processings.
Practical investigation, particularly forensic processing and comparison of phonograms of oral speech aimed at identification of a speaker are known in some cases to present obstacles to expert evaluation, such as short duration and low quality of the phonograms examined, different psycho-physiological states of the speakers in the phonograms compared, different contents and languages of the speech, as well as different types and levels of audio channel noise and distortion, etc., making it difficult to make a decision.
2. Description of the Related Technology
A method is known for identification of a speaker by phonograms, wherein the speaker's characteristics are extracted from certain uniform phrases spoken thereby (DE 2431458, IPC G 10 L 1/04, May 2, 1976).
This method comprises filtering speech signals through a comb of 24 bandpass filters, detecting the signals, smoothing the signals, and inputting the signals through an analog-digital converter and a switch into a digital processing device, wherein individualizing features associated with the integral speech spectrum are automatically recognized and stored.
Unfortunately, this method cannot be used by using phonograms of oral speech obtained in excessive distortion and noise environment because the method fails to provide sufficient number of individualizing features. Furthermore, this method has not proved to be sufficiently reliable for identification because it requires use of phonograms comprising identical verbal content for both a verified speaker and an unknown speaker.
A method is known for identification of an individual by using phonograms, wherein phonograms to be compared are subject to a voice input which is compared with a previously stored voice signature of that individual by singling out and comparing uniform keywords from the records under analysis (U.S. Pat. No. 3,466,394, IPC N04M1/24).
This method involves subjecting a speech signal to short-term spectral analysis, and then recognizing the contours of spectrum and main voice tone peculiarities dependence on the time period. The resulting contours are regarded as individualizing features. Identification of a speaker is based on a comparison of the phonograms contours obtained for the inspected and unknown speakers.
The weak point of this method is the recognition result dependence on the quality of phonograms made in excessive distortion and noise environment. Besides, this method has a high percentage of identification failures because it requires phonograms of the inspected and unknown speakers with the same words.
A method is known for identification of a speaker based on spectral-band-temporal analysis of spontaneous oral speech (G. S. Ramishvili, G. B. Chikoidze Forensic processing of phonograms and identification of a speaker. Tbilisi: “Mezniereba”, 1991, p. 265).
To eliminate the dependence of the identification results on the speech semantics, sonorous speech elements are singled out of the verbal message, their energy values averaged over their lifetime in each of the 24 spectral filters in the field of higher formant sections. The basic tone recognition is based on singling out the fundamental component of the signal in the spectrum. The speech rate is recognized as well.
The parameters aforesaid are used as individualizing features.
This method fails for phonograms made in excessive distortion and noise environment of the speech record channel and different speakers' states due to the loss of the individualizing feature set validity.
A device and method are known for speaker recognition based on purely statistical models of known and unknown speakers' cepstral speech signal feature construction and comparison, for example (U.S. Pat. No. 6,411,930, IPC G10L15/08). Speaker recognition is performed by using Discriminative Gaussian mixture models.
This method, like most purely statistical approaches to speaker recognition, fails for very short (1 to 10 seconds) voice messages as well as the situations where the speakers' states and/or phonograms channels possess strongly different properties, or the speakers are in different emotional states.
A method is known for speaker recognition by using only stochastic characteristics (U.S. Pat. No. 5,995,927, IPC G10L9/00).
Speaker recognition is performed by constructing and comparing feature description covariance matrices of an input speech signal and reference speech signals of known speakers.
This method also fails to be used for short (5 seconds or less) voice messages, and is very sensitive to significant signal power reduction in particular areas of speech frequency range due to ambient noise as well as poor quality of microphones and channels for sound transmission and recording.
A method is known for recognition of isolated word, the method adaptable to a speaker (RU 2047912, IPC G10L7/06), which is based on input speech signal sampling, pre-emphasis, successive speech signal segmentation, segment coding with discrete elements, energy spectrum calculation, measuring formant frequencies and determining amplitudes and energy in different frequency bands of the speech signal, classification of articulatory events and states, defining and grading word standards, calculating intervals between word standards with the actualization of the word recognized, word recognition or failure, supplementing the standard dictionary in the course of adaptation to the speaker. Input speech signal is pre-emphasized in the time domain by differentiation from smoothing, energy spectrum quantization depends on the channel noise variance, formant frequency is determined by discovering the global maximum of the logarithmic spectrum and subtracting the given frequency-dependent function out of the spectrum aforesaid, classification of articulatory events and states determines the proportion of periodic and noise excitation sources as compared to the threshold value of square-wave pulse sequence autocorrelation coefficients in multiple frequency bands, the beginning and the end of articulatory movements and their corresponding acoustic processes are determined against the threshold value of the likelihood function from autocorrelation coefficients, formant frequencies and energies in given frequency bands, the speech signal is divided into intervals between the beginning and the end of acoustic processes, corresponding to specific articulatory movements, and sequentially, starting with vowels, and a segment is recognized only when its left and right boundary transition types match each other, while segmentation is finished when pauses between words in the left and right time segments have been recognized. Word standards are shaped as matrices with binary feature likelihood values, and recognition fails when the normalized interval difference from the unknown actualization to the next two standards that belong to different words is smaller than the set threshold value.
The disadvantage of this known method of isolated word recognition adapted to the speaker is its poor distinctiveness when recognizing speakers by spontaneous speech, since in most cases it does not distinguish between speakers of the same sex transferring a verbal message with the same contents.
A security system is known based on voice recognition (U.S. Pat. No. 5,265,191, IPC G10L005/00), which requires both the trainer and the unknown speaker to repeat at least one voice message. The system compares parametric representations of repeated voice messages made by the unknown and the known speaker and establishes the identity of the speakers compared only if every message pronounced by the unknown speaker is close enough to that made by the trainer, indicating failure if their representations strongly differ from each other.
The weak point of this system is its poor resistance to variable noises (vehicle and street noise, industrial premises) as well as the mandatory requirement for both speakers to pronounce one and the same voice message.
A method is known for automatic identification of a speaker by the peculiarities of the password phrase pronunciation (RU 2161826, IPC G10L17/00), which involves breaking the speech signal into voiced zones and defining time intervals within the zones aforesaid—at the speech signal intensity maxima, as well as at the beginning of the first and at the end of the last voiced zones. There are speech signal parameters set for the defined time intervals and compared to standards taking into account mathematical expectations and their acceptable repeatability error, with a view to which there are time intervals defined at the end of the first and at the beginning of the last voiced zones, as well as at the beginning and at the end of the others; the duration of time intervals is set as a multiple of the speech signal fundamental tone period, speech signal correlation coefficient values are determined and included with those compared with the standards, the formation of additional standards takes into account the speech signal parameter correlation coefficients. Identification of a speaker is based on the speech signal parameters and corresponding statistical characteristics.
The disadvantage of this known method is its poor noise resistance, since it requires determining the exact position of the fundamental voice tone period boundaries in the input speech signal, which is often hardly possible under acoustic and electromagnetic interference (office and street noise, speech channel settings, etc.), besides, speakers have to pronounce the same voice passwords and that cannot always be achieved in practice.
A speaker verifier based on the “nearest neighbor” distance measurement (U.S. Pat. No. 5,339,385, IPC G10L9/00) is known, including a display, a random hint generator, a speech recognition unit, a speaker verifier, a keyboard and a primary signal processor, with the primary signal processor inlet being the inlet of the verifier and its outlet connected to the first speech recognition unit and speaker verifier inlets, the first outlet of the hint generator being connected to the second inlet of the speech recognition unit, the outlet of which is connected to the display. The keyboard is connected to the third inlets of the speech recognition unit and the speaker verifier, the outlet of which is the outlet of the verifier. The speaker verifier used to establish the similarities or differences in voice passwords pronounced involves breaking the input speech signal into particular frames for analysis, calculating non-parametric speech vectors for each analysis frame and further proximity determination of thus obtained speech signal descriptions of the pronunciations compared on the basis of Euclidean distance to the nearest neighbor.
The disadvantage of this verifier is its poor noise resistance in office and street environment due to non-parametric speech vectors and Euclidean metrics in determining the degree of similarity/difference in a voice password pronunciations, as well as low recognition reliability (large share of false failures) due to the use of voice passwords with different word order caused by inevitable individual variability of pronouncing the same words in different contexts, even by the same speaker. Besides, it is hardly possible to secure pronunciation of the prompted verbal content by both speakers compared.
A method for speaker recognition (U.S. Pat. No. 6,389,392, IPC G10L17/00) is known, which involves comparing input speech signal obtained from the unknown speaker to speech standards of speakers known before, of which at least one is represented by two or more standards. Successive input signal segments are compared to standard successive segments to obtain a measure of proximity of the input and the standard speech signal segments compared. For each standard of a known speaker with at least two standards there are standard and input speech signal comparison composite results based on the input speech signal selection for each segment of the closest segment of the standard compared to the segment compared. Then the unknown speaker is recognized by the composite results of the input speech signal and standard comparison.
This known method of speaker recognition is limited in practical application, as the requirement for at least two standards for each verbal message is not always feasible in the actual environment. Besides, this method does not guarantee high reliability of speaker recognition in the environment of real acoustic office, street or vehicle noise, different emotional states of the speakers, as the segment by segment parametric speech signal description is subject to strong influence of additive acoustic noise and natural variability of speech. In addition, the low reliability of the method in excessive noise environment arises out of the fact that the closest standard segment compared within the proximity measure employed is to be found for each segment of the input speech signal, which entails a large number of pure noise segments corresponding to speech pauses both in the standard and the input speech signal.
A method for speaker recognition by phonograms of spontaneous oral speech (RU 2107950, IPC G10L5/06) is known. The method is based on using spectral-band-temporal analysis of speech signals, determining peculiarities of an individual's speech and comparing them with references, using the acoustic integral features as a parameter estimate of statistical distribution of the current range components and the main tone period and frequency distribution histograms, measured by phonograms with both spontaneous and fixed characteristics, measured on phonograms with both spontaneous and fixed contexts, taking the most informative ones for this speaker, not influenced by the noise and distortion present in the phonograms and use linguistic data (fixed or spontaneous ones), registered by an expert in the course of the phonograms auditory analysis involving the bank's support automated voice standards of oral speech dialects, accents, and defects.
This method loses its reliability for speakers with short phonograms, speaking different languages or being in substantially different psycho-physiological states because of employing an integral approach, averaging the speech signal and linguistic analysis characteristics.
A method is known for speaker recognition (RU 2230375, IPC G10L15/00, G10L17/00), which includes a segment by segment comparison of the input speech signal with samples of voice passwords pronounced by known speakers, and assessment of similarity between a first phonogram of the speaker, and a second, or a sample, phonogram by matching formant frequency in referential utterances of a speech signal, where the utterances for comparison are selected from the first record and the second record.
The known method identifies formant vectors of consecutive segments and statistical characteristics of the input speech signal power spectrum and speech signal standards, comparing them, respectively, to formant vectors of successive segments of each standard and with the standard speech signal power spectrum statistical characteristics, and forming a composite comparison metrics for the input and the standard signals. A weighted modulus of formant vector frequency difference is used as a measure of formant vector segment proximity. To calculate a composite input signal and standard comparison metrics, each input speech signal segment is given the closest standard segment within the corresponding measure of proximity with the same number of formants, and the composite metrics includes a weighted average measure of proximity between this input speech signal segment and the closest standard segment for all commonly used input speech signal segments, as well as a cross-correlation coefficient of the input speech signal power spectra and standard statistical characteristics. Speaker recognition is based on the outcome of the input speech signal and the standard composite metric comparison.
This method does not secure reliable speaker recognition when the phonic structure of the input speech signal differs strongly from that of the speech signal samples (e.g., short messages, different languages of the input signal and the standards), as well as in case of significant differences in the properties of record channels and differences in psycho-physiological state of the speakers in the phonograms compared. These shortcomings arise, to begin with, out of the use of the power spectrum statistical characteristics, which depend on the record channel properties, the state of the speaker and the phonic structure of the message, as a composite metric component, as well as the segmental proximity measure as a weighted average for all commonly used segments of the speech signal processed, which leads to averaging segment comparison errors and underestimating the influence of large inter-segment deviations, revealing the difference between the speakers even when there is a small average segment difference observed.