1. Field of the Invention
The present invention relates to an information detection apparatus and method, and to an information search apparatus and method. More particularly, the present invention relates to an information detection apparatus and method, and to an information search apparatus and method, for performing speaker identification and speaker searching in speech data or sound image data.
2. Description of the Related Art
In recent years, often, speech signals are digitized, the digital speech signals are coded, and then the digital speech signals are stored or used. These speech coding methods can be broadly classified into speech waveform coding, analysis/synthesis coding, and hybrid coding in which these coexist.
Here, in speech waveform coding, a speech waveform is coded so that it can be reproduced as faithfully as possible. In analysis/synthesis coding, a signal is coded by representing it by parameters on the basis of a speech creation model. In particular, as analysis/synthesis coding, analysis/synthesis systems using linear predictive coding (LPC) analysis have been investigated. For example, there are harmonic coding, multipulse driving linear predictive coding (MPC) using an analysis-by-synthesis (A-b-S) method, code excited linear prediction (CELP) coding by closed loop search of an optimum vector, etc.
In general, in the coding method using LPC analysis, spectral envelope information is extracted by linear predictive analysis (LPC analysis), and the LPC information is converted into PARCOR (PARtial auto-CORrelation) coefficients or LSP (Linear Spectrum Pair) coefficients and then coded. Furthermore, a method has been investigated in which a determination is made as to whether the speech is a voiced sound or an unvoiced sound for each block and in which harmonic coding is used for voiced sound and CELP coding is used for unvoiced sound. In addition, a hybrid method has also been investigated in which coding is performed using analysis/synthesis coding by LPC analysis and using speech waveform coding for the LPC residual signal thereof.
FIG. 10 shows the overall configuration of a general speech coding apparatus using LPC analysis. In FIG. 10, a speech signal D100 input from an input section 100 is subjected to LPC analysis in an LPC analysis section 101, and an LPC coefficient D101 is determined. The LPC coefficient D101 is converted into an LSP parameter D102 in an LSP conversion section 102. The LSP parameter D102 is quantized in an LSP quantization section 103. Since the performance degradation of the LSP parameter is smaller than that of the LPC coefficient when the LSP parameter is quantized, usually, the LPC coefficient is converted into an LSP parameter and then quantized. As the quantization method of the LSP parameter, vector quantization is often used.
Meanwhile, in an inverse filter section 104, the input signal D100 is filtered using the determined LPC coefficient D101, and an LPC residual signal D104 is extracted from the input signal D100. For the coefficient used for the inverse filter, a coefficient which is inversely converted from a quantized LSP parameter into an LPC coefficient is also often used.
The LPC residual signal D104 which is determined in this manner is converted into a spectrum coefficient D105 in a spectrum conversion section 105, and quantization is performed thereon in a spectrum quantization section 106. For the quantization of the spectrum coefficient, a vector quantization method, and a method in which quantization based on an auditory psychological model, Huffman coding, etc., are combined, are often used.
A quantized LSP parameter D103, a quantized spectrum D106, and other additional information, which are determined in this manner, are sent to a bit combining section 107, where a coded bit stream D107 is generated in accordance with a specified data format and is output to an output section 108.
In addition to the configuration of the speech coding apparatus shown in FIG. 10, a method has been investigated in which the pitch is extracted using an LPC residual signal, and pitch components are extracted from the LPC residual signal, thereby flattening the spectrum residual. Furthermore, a method has also been investigated in which a determination is made as to whether the speech is a voiced sound or an unvoiced sound, a harmonic is extracted from the spectrum residual signal of the voiced sound, and the harmonic is quantized.
An example of a recording format of coded speech data generated by a speech coding apparatus using LPC analysis, such as that shown in FIG. 10, is shown in FIG. 11. As shown in FIG. 11, quantized LSP information is held in coded data. This quantized LSP information can easily be converted into an LPC coefficient. Since the LPC coefficient shows spectral envelope information, it can also be considered that quantized spectral envelope information is held.
Technology for identifying a speaker in a speech signal has also been intensely investigated. This technology will be described below.
First, speaker recognition includes speaker identification and speaker verification. Speaker identification determines which speaker, from among speakers registered in advance, produced the input speech. Speaker verification makes personal identification by comparing the input speech with the data of the speaker which is registered in advance. Furthermore, there are two types of speaker recognition: a speech production dependent type in which words (keywords) which are produced during recognition are determined in advance, and a speech production independent type in which arbitrary words are produced for recognition.
As a general speech recognition technology, for example, the following technology is often used. First, features representing the individuality of a speech signal of a particular speaker are extracted and recorded in advance as learnt data. Identification/verification of the speaker is performed in such a way that the input speech of the speaker is analyzed to extract the features representing the individuality, and the similarity of the features with the learnt data is evaluated. Here, for the features representing the individuality of speech, a Cepstrum is often used. Cepstrum means that a logarithmic spectrum is subjected to an inverse Fourier transform, and the envelope of the speech spectrum can be represented by the coefficients of the low-order terms thereof. Alternatively, often, LPC analysis is performed on a speech signal in order to determine an LPC coefficient, and the LPC Cepstrum coefficient obtained by converting the LPC coefficient is used. The polynomial expansion coefficients of the time series of these Cepstrums or LPC Cepstrum coefficients are called “delta Cepstrums”, and these are often used as features representing the change of the speech spectrum over time. In addition, the pitch and the delta pitch (polynomial expansion coefficients of the pitch) are sometimes used.
Learing data is created by using the features, such as LPC (Linear Predictive Coding) Cepstrums, extracted in this manner, as a standard pattern. The typical methods thereof include a vector-quantization distortion method and a hidden Markov model (HMM) method.
In the vector-quantization distortion method, the features for each speaker are grouped, and the center of gravity thereof is stored in advance as an element (code vector) of a codebook. Then, the features of the input speech are subjected to vector quantization by using the codebook of each speaker, and the average quantized distortion of each codebook with respect to the entire input speech is determined.
In the case of speaker identification, the speaker of the codebook in which the average quantized distortion is smallest is selected. In the case of speaker verification, the average quantized distortion by the codebook of the corresponding speaker is compared with a threshold value in order to make personal identification.
On the other hand, in the HMM method, the features of the speaker determined in the same manner as that described above are represented by the transition probability between states of the hidden Markov model (HMM) and the appearance probability of the features in each state. The features are determined by the average likelihood with respect to the model in the entire input speech region.
Furthermore, in the case of speaker identification in which independent speakers which are not registered in advance are contained, a determination is made by a method combining the above-described speaker identification and speaker verification. That is, the closest speaker is selected as a candidate from a set of registered speakers, and the quantized distortion or the likelihood of the candidate is compared with a threshold value in order to make personal identification.
In the speaker verification or the speaker identification in which independent speakers are contained, in order to make personal identification, the likelihood of the speaker or the quantized distortion is compared with a threshold value. At this time, for these values, due to variations of the features over time, differences in spoken sentences, and the influence of noise, variations between the input data and the learnt data (model) are large even for the same speaker. Generally, even if the threshold value is set to the absolute value thereof, a sufficient recognition rate cannot be reliably obtained.
Therefore, in speaker recognition in HMMs, normalizing the likelihood is generally performed. For example, there is a method in which a log likelihood ratio LR, such as that shown in the following equation (1), is used for the determination:LR=log L(X/Sc)−max {log L(X/Sr)}  (1)
In equation (1), L(X/Sc) is the likelihood of the verification target speaker Sc (identified person) with respect to the input speech X. L(X/Sr) is the likelihood of a speaker Sr other than the speaker Sc with respect to the input speech X. That is, a threshold value is set dynamically in accordance with the likelihood with respect to the input speech X, and the speaker recognition becomes robust with respect to differences in spoken content and variations over time.
Alternatively, a method of making a determination by a posterior probability, such as that shown in the following equation (2), by using the concept of a posterior probability, has also been investigated. Here, P(Sc) and P(Sr) are appearance probabilities of the speakers Sc and Sr, respectively. Σ represents the sum of all the speakers.L(Sc/X)=[L(X/Sc)·P(Sc)]/[ΣL(X/Sr)·P(Sr)]  (2)
These methods of likelihood normalization using HMMs are described in detail in reference [4], etc., which will be described later.
In addition to those described above, in the conventional speaker recognition technology, a method has been investigated in which, instead of using all the blocks of a speech signal for recognition, for example, the voiced sound (vowel) part and the unvoiced sound (consonant) part of the input speech signal are detected and in which recognition is performed by using only the voiced sound (vowel) part. Furthermore, a method has also been investigated in which recognition is performed by using individual learning models or codebooks with the voiced sound (vowel) and the unvoiced sound (consonant) being discriminated.
The conventional technologies regarding speaker recognition described above are described in detail, for example, in the following references: [1] Furui: “Speaker recognition by statistical features of the Cepstrum”, Proc. of The Institute of Electronics, Information and Communication Engineers (IEICE), Vol. J65-A, No. 2, pp. 183-193 (1982), [2] F. K. Soong and A. E. Rosenberg: “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition”, IEEE Trans. ASSP, Vol. 36, No. 6, pp. 871-879 (1988), [3] Furui: “Topic of speech individuality”, Journal of Acoustical Society of Japan (ASJ), 51, 11, pp. 876-881, (1995), [4] Matsui: “Speaker recognition by HMM”, Technical Report of IEICE, Vol. 95, No. 467, (SP95 109-116) pp. 17-24 (1996), [5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE PRESS (CRC Press), 1998, [6] F. K. Soong, A. E. Rosenberg, L. R. Rabiner and B. H. Juang: “A vector quantization approach to speaker recognition”, Proc. IEEE, Int. Conf. on Acoust. Speech & Signal Processing, pp. 387-390 (1985).
Identification processes in the conventional speaker detection and searching are performed in such a way that a speech signal is digitized and the digitized speech waveform is analyzed. However, recently, with the proliferation of and advances in high-efficiency speech coding technology, most speech data is stored and used in a compressed and coded format. In order to identify and search for a speaker based on the features of the speech with respect to such speech data, it is necessary to decode all coded speech data to be searched into a speech waveform, to analyze the features, and to perform an identification process and a searching process. Since such decoding, analysis, and identification processes must be performed on all target speech data, a large number of computations and a lot of processing time are required, and furthermore, a storage area corresponding to the capacity capable of storing the decoded speech data becomes necessary. Furthermore, the recognition performance may become deteriorated due to the influence of performing decoding processing into a speech waveform and performing reanalysis processing.