Speech recognition technology is used in a wide variety of applications such as information input, information retrieval, speech input support, video indexing, speaker recognition, personal identification by speech, tone measurement and environment measurement. In order to enhance the accuracy of speech recognition, an attempt has been made to reduce the effect of a variation factor caused by a transmission channel, noise or the like by learning an acoustic model.
FIG. 10 shows a schematic example of an acoustic model learning device that implements the acoustic model learning technique disclosed in Non-Patent Document 1 and Non-Patent Document 2. As shown therein, an acoustic model learning device 1 includes a speech data storing means 11, a channel label storing means 12, a speaker independent model learning means 13, a channel model learning means 14, a speaker independent model storing means 15, and a channel model storing means 16.
The speech data storing means 11 stores sample speech data which is acquired through various transmission channels. The transmission channels mean the varieties of physical devices which a speech from a speech source such as a speaker has passed through until the speech is recorded, and among examples are a fixed phone (including a fixed phone terminal and a fixed communication line), a mobile phone (including a mobile phone terminal and a mobile communication line), a vocal microphone and so on. Hereinafter, the transmission channel is also referred to simply as a channel.
Further, even if the content of a speech is the same, the speech as data is different depending on whether a speaker is female or male. Likewise, even with the same speech content and the same speaker, the speech as data is different depending on which of a fixed phone or a mobile phone the speech is recorded through. A speech source, a transmission channel or the like which has a plurality of different types, the different types causing a variation to occur in the speech, is called an acoustic environment.
The channel label storing means 12 of the acoustic model learning device 1 stores label data which corresponds to sample speech data stored in the speech data storing means 11 and indicates a channel which the sample speech data has passed through.
The speaker independent model teaming means 13 receives the sample speech data and the label data from the speech data storing means 11 and the channel label storing means 12, respectively, removes a variation component that is dependant on the acoustic environment of a channel from the sample speech data and extracts only a variation component that is dependant on the acoustic environment of a speaker, thereby learning a speaker independent acoustic model. In the following description, the “speaker independent acoustic model” is also referred to as a “speaker independent model”.
The channel model learning means 14 receives the sample speech data and the label data from the speech data storing means 11 and the channel label storing means 12, respectively, and, with respect to each channel, learns an affine transformation parameter which corresponds to an acoustic model of the channel. Specifically, the channel acoustic model can be calculated, based on the assumption that it can be obtained by performing affine transformation of a speaker independent model, by learning its parameter. In the following description, the “channel acoustic model” is also referred to as a “channel model”.
Note that the speaker independent model learning means 13 and the channel model learning means 14 perform the iterative method described in Non-Patent Document 3 in cooperation with each other, update the speaker independent acoustic model and the affine transformation parameter (channel acoustic model) and, after the iterative method converges, output the definite speaker independent acoustic model and affine transformation parameter.
The speaker independent model storing means 15 receives and stores the speaker independent model from the speaker independent model learning means 13, and the channel model storing means 16 receives and stores the channel model from the channel model learning means 14.
According to the acoustic model learning device 1, with respect to each channel, the affine transformation parameter specific to each channel can be acquired. Therefore, it is considered that, by applying the affine transformed acoustic model to the speech data input from any known channel or executing inverse affine transformation on the speech data, it is possible to reduce a variation factor due to a channel and correctly recognize a recognition target.
[Non-Patent Document 1]
    D. A. Reynolds, “Channel robust speaker verification via feature mapping,” Proc. ICASSP2003, Vol. II, pp. 53-56, 2003[Non-Patent Document 2]    D. Zhu et al., “A generalized feature transformation approach for channel robust speaker verification,” Proc. ICASSP2007, Vol. IV, pp. 61-64, 2007[Non-Patent Document 3]    T. Anastasakos et al., “A compact model for speaker-adaptive training,” Proc. ICSLP96, 1996