1. Field of the Invention
The present invention relates to an information retrieving method and an information retrieving apparatus. More particularly, it relates to an information retrieving method and an information retrieving apparatus in which a speaker of the speech information is recognized and discriminated to detect and retrieve the speaking position of a desired speaker. This application claims priority of Japanese Patent Application No. 2002-017621, filed on Jan. 25, 2002, the entirety of which is incorporated by reference herein.
2. Description of Related Art
Recently, it is frequently practiced to digitize speech signals, compress and encode the information of the digital speech signals for reducing the information volume of the digital speech signals and to store the so encoded information in a storage device or a recording medium for use later on. In particular, there has been developed a digital speech recorder, or a so-called IC recorder, in which the speech, such as conversation in a conference or an interview, is encoded by a speech encoding technique for recording on a semiconductor storage device (memory) or a storage medium exploiting a semiconductor memory (memory card).
The configuration of a routine IC recorder is shown in FIG. 6, in which an IC recorder 100 is made up by a microphone 101, an A/D (analog to digital) converter 102, a speech encoder 103, a speech decoder 104, a D/A (digital to analog) converter 105, a loudspeaker 106, a semiconductor storage device (memory) 107, an information transmission unit 108, and an output terminal 109. A semiconductor storage medium (memory card) may also be used in place of the semiconductor storage device 107.
The speech signals, input via microphone 101, are converted in the A/D converter 102 and compression-coded by the speech encoder 103 so as to be then stored in the semiconductor storage device 107. The compression-coded speech data, thus stored in the semiconductor storage device 107, are read out and reproduced by the speech decoder 104 and converted by the D/A converter 105 into analog signals so as to be output at the loudspeaker 106, or are read out by the information transmission unit 108 so as to be transmitted to outside equipment via output terminal 109.
Meanwhile, there has also been developed a system in which, in recording speech data in the IC recorder, simple additional information or attribute information of speech data, such as data name, date or simple comments, can be recorded along with the speech data.
In many cases, the IC recorder has a random access function of pre-registering a position in the speech data as the index information for reproduction promptly from the registered position at the time of reproduction. In the IC recorder, simple comments pertinent to the registered position can be appended as the tag information.
The speech encoding system used in the IC recorder is hereinafter explained. The speech encoding system may roughly be classified into a waveform encoding, an analysis-by-synthesis encoding and a hybrid encoding which is a combination of the waveform encoding and the analysis-by-synthesis encoding.
The waveform encoding encodes the speech waveform so that the waveform may be reproduced as faithfully as possible. The analysis-by-synthesis encoding expresses the signals by parameters, based on the speech generating model, for encoding.
There exist a variety of techniques and apparatus for waveform encoding. Examples of the techniques and apparatus include sub-band coding in which audio signals on the time axis are split into plural frequency bands and encoded without blocking, and transform encoding in which the signals on the time axis are blocked every unit time and transformed into spectral components which are encoded. There is also proposed a technique of high efficiency encoding consisting in the combination of the aforementioned sub-band encoding system and the transform encoding system. With this technique, the time domain signals are split into plural frequency bands by means of the sub-band coding, and the signals of the respective bands are orthogonal-transformed into signals on the frequency axis, and the frequency domain signals, resulting from the orthogonal transform, are encoded from one frequency band to another.
As analysis-by-synthesis encoding, researches in the analysis-by-synthesis system, employing linear predictive coding (LPC) are now proceeding. This encoding may be exemplified by harmonic encoding, a multi-pass drive linear predictive coding (MPC) employing the analysis-by-synthesis method (A-b-S), and code excited linear prediction (CELP) coding.
In general, in the encoding system, employing the LPC analysis, the spectral envelope information is extracted by linear prediction coding (LPC) analysis, and the LPC information is transformed into PARCOR coefficients (PARtial auto-CORrelation coefficient) or LSP (linear spectrum pair) coefficients for quantization and encoding. There is also researched a hybrid system consisting in the combination of the analysis-by-synthesis encoding by LPC analysis and the waveform encoding of the LPC residual signals. This system is routinely used for an IC recorder for recording the conferencing.
FIG. 7 shows a schematic structure of a routine speech encoding system employing the LPC analysis. In FIG. 7, an LPC analysis unit 201 performs LPC analysis on speech signals D200 input via an input device 200 to find LPC coefficients D201. The LPC analysis unit 201 sends the LPC coefficients D201 to an LSP conversion unit 202.
The LSP conversion unit 202 converts the LPC coefficients D201 into LSP parameters D202 to route the LSP parameters D202 to an LSP quantizer 203, which LSP quantizer 203 quantizes the LSP parameters D202. Since the LSP parameters undergo deterioration on quantization to a lesser extent than the LPC coefficients, the routine practice is to perform conversion to the LSP parameters followed by quantization. Meanwhile, the technique for quantizing the LSP parameters is usually vector quantization.
An LPC inverse filtering unit 204 inverse quantizes the quantized LPC parameters 203 and further inverse transforms the parameters into LPC coefficients D204 which are then used for filtering the input signal D200 to extract LPC residual signals D205 from the input signals D200. The LPC inverse filtering unit 204 routes the extracted LPC residual signals D205 to a pitch analysis unit 205 and a pitch inverse filtering unit 207.
The pitch analysis unit 205 applies pitch analysis to the so found LPC residual signals D205 and sends the pitch information D206, such as pitch lag or pitch gain, resulting from the analysis, to a pitch quantizer 206, which then quantizes the pitch information D206.
The pitch inverse filtering unit 207 filters LPC residual signals D205, using the pitch information D208, obtained on inverse quantizing the quantized pitch information D207, to extract pitch components from the LPC residual signals D205. The pitch inverse filtering unit 207 sends flattened residual signals D209 to an orthogonal transform unit 208.
The orthogonal transform unit 208 transforms the residual signals D209 int spectral coefficients D210. A spectral quantizing unit 209 quantizes the spectral coefficients D210. In quantizing the spectral coefficients D210, a technique by vector quantization or a technique which is the combination of the quantization based on psychoacoustic model and the Huffman coding is used.
The quantized LPC parameters D203, quantized pitch information D207, quantized spectral data D211 and other subsidiary information are sent to a bit synthesis unit 210 where an encoded bitstream D212 is generated in accordance with a prescribed data format and supplied and output at an output unit 211.
FIG. 8 shows an illustrative recording format for encoded speech data generated by a speech encoding device employing the LPC analysis such as is shown in FIG. 7. Referring to FIG. 8, the encoded speech data is made up by the subsidiary data, such as data identification numbers, data names or data attributes, and block data of the speech information. On the other hand, the block data is made up by for example, the header, block-based subsidiary information, pitch information, LSP information and the spectral information.
FIG. 9 shows the schematic structure of a speech decoding device which is a counterpart device of the speech encoding device shown in FIG. 7. In FIG. 9, a bit decomposing unit 221 decomposes encoded data D220, input from an input unit 220 every predetermined block, into several partial elements. For example, the bit decomposing unit 221 decomposes the encoded data D220 into the quantized LSP information D221, quantized pitch information D222 and the quantized residual spectral information D223, on the block basis. The bit decomposing unit 221 sends the quantized LSP information D221, quantized pitch information D222 and the quantized residual spectral information D223 to an LSP inverse quantizing unit 222, a pitch inverse quantizing unit 223 and a spectral inverse quantizing unit 224, respectively.
The LSP inverse quantizing unit 222 inverse quantizes the quantized LSP information D221 to generate LSP parameters, which LSP parameters are then transformed into LPC coefficients D224. The LSP inverse quantizing unit 222 sends the LPC coefficients D224 to an LPC synthesis unit 227.
The pitch inverse quantizing unit 223 inverse quantizes the quantized pitch information D222 to generate the pitch information D225, such as pitch period or pitch gain. The pitch inverse quantizing unit 223 sends the pitch information D225 to a pitch synthesis unit 226.
The spectral inverse quantizing unit 224 inverse quantizes the quantized residual spectral information D223 to generate a residual spectral data D226 which is supplied to an inverse orthogonal transform unit 225.
The inverse orthogonal transform unit 225 applies inverse orthogonal transform to the residual spectral data D226 for conversion to a residual waveform D227. The inverse orthogonal transform unit 225 sends the residual waveform D227 to the pitch synthesis unit 226.
The pitch synthesis unit 226 filters the residual waveform D227, using the pitch information D225, supplied from the pitch inverse quantizing unit 223, to synthesize an LPC residual waveform D228. The pitch synthesis unit 226 sends this LPC residual waveform D228 to the LPC synthesis unit 227.
The LPC synthesis unit 227 filters the LPC residual waveform D228, using the LPC coefficients D224 supplied from the LSP inverse quantizing unit 222, to synthesize a speech waveform D229. The LPC synthesis unit 227 sends this speech waveform D229 to an output unit 228.
The technique for discriminating a speaker of a speech waveform, which is explained hereinafter, now is also researched briskly.
As a routine speech recognition technique, the following technique, for example, is used. First, characteristic values representative of a personality of speech signals by a speaker are extracted and pre-recorded as learning data. An input speech of a speaker is analyzed and characteristic values indicative of his or her personality are extracted and evaluated as to similarity with the learning data to discriminate and collate the speaker. As the characteristic values representative of a personality of speech, the cepstrum, for example, is used. Alternatively, LPC analysis is applied to speech signals to find LPC coefficients which are then transformed to produce LPC cepstrum coefficients usable as the characteristic values. The coefficients obtained on expansion to a time-domain polynominal of the cepstrum or LPC cepstrum coefficients, termed delta cepstrum, are used preferentially as characteristic values indicative of temporal changes of the speech spectrum. Additionally, the pitch or the delta pitch (coefficients obtained on expansion of a pitch polynominal) may also be used.
The learning data is prepared using the characteristic values, such as LPC (linear predictive coding) cepstrum, thus extracted, as standard patterns. As a method therefor, a method by vector quantization distortion or a method by HMM (Hidden Markov model) is preferentially used.
In the method by vector quantization distortion, the speaker-based characteristic values are grouped and the center of gravity values are stored as elements (code vectors) of a codebook. The characteristic values of the input speech are vector-quantized with the codebook of each speaker to find the average quantization distortion of each codebook with respect to the entire input speech. The speaker of the codebook with the smallest average quantization distortion is selected.
With the method by HMM, the speaker-based characteristic values, found as described above, are represented by the transition probability among the HMM states, and the probability of occurrence of the characteristic values in each state, and are determined for the entire input speech domain based on the average likelihood with a model.
Meanwhile, if, in the conventional IC recorder, employing the semiconductor storage device, the speaker's conversation in the recorded speech data is to be accessed and reproduced, the IC recorder has to own the function of registering the index information, while the index information has to be registered in advance in the IC recorder. For registering the index information, it is required for the human being to make audio-visual check of the entire domain of the speech data to search into the data portion of the speaker's conversation, by an extremely labor-consuming operation.
Moreover, even if the index information is registered, it is not that easy to comprehend in which data portion and with which frequency the desired speaker is speaking.
With an IC recorder not having the function of registering the index information or the tag information, the data portion including the speaker's conversation cannot be detected or retrieved, while it is not possible to reproduce the data from the conversation of the desired speaker or to partially reproduce only the conversation domain of the desired speaker.