1. Field of the Invention
The present invention relates to a technique for generating speech data of a target domain using speech data of a source domain, and more particularly, to a technique for mapping the source domain speech data on the basis of the channel characteristic of the target domain speech data.
2. Description of the Related Art
The performance of speech recognition significantly depends on the acoustic environment of a target domain. That is, if there is an acoustic mismatch between an environment for learning an acoustic model and an environment for evaluating speech, the performance of a speech recognition system deteriorates in many cases. The mismatch between the environments increases due to various causes such as background noises, the acoustic characteristic of a recording apparatus, and channel distortion. Therefore, conventionally, a great amount of time and labor has been spent to record speech data in a particular environment in order to avoid the mismatch between the environments from being caused by constructing an acoustic model of a target domain.
In contrast thereto, it has become possible to acquire a large amount of live speech data at a low cost due to the Internet services (for example, speech search and voice mail) using a handheld device such as a smartphone in recent years. Therefore, there is a demand for reusing abundant speech data in such various acoustic environments.
Traditionally, approaches for the cross domain problem in speech recognition are roughly classified into the following four:
1. Reuse method
2. Model adaptation method
3. Feature value conversion method
4. Normalization method
The reuse method of 1. is a method of, in order to construct an acoustic model of a target domain, simulating target domain speech data using source domain speech data (for example, see Non-patent Literatures 1 (“Acoustic Synthesis of Training Data for Speech Recognition in Living Room Environments”) and 2 (“Evaluation of the SPLICE Algorithm on the Aurora2 Database”)).
The model adaptation method of 2. is a method of changing a parameter of an acoustic model of a source domain to adapt it to test speech, and Maximum A Posteriori Estimation (MAP) and Maximum Likelihood Linear Regression (MLLR) correspond thereto (for example, see Patent Literature 1 (JP2012-42957A) and Non-patent Literatures 3 (“A vector Taylor series approach for environment-independent speech recognition”), 4 (“Factored Adaptation for Separable Compensation of Speaker and Environmental Variability”), and 5 (“Maximum likelihood linear transformations for HMM based speech recognition”)). As techniques for adapting a model similarly, Patent Literatures 2 (JP2002-529800A) and 3 (JP10H-149191A) and Non-patent Literature 6 (“Robust continuous speech recognition using parallel model combination”) exist though they are different from the above method.
The feature value conversion method of 3. is a method of converting the feature value of test speech to adapt it to an acoustic model of a source domain at the time of decoding, and Feature Space Maximum Likelihood Linear Regression (fMLLR) and Feature Space Minimum Mutual Information (fMMI) correspond thereto (for example, see Non-patent Literatures 3 to 5, and 7 (“fMPE: Discriminatively Trained Features for Speech Recognition”)).
The normalization method of 4. is a method of normalizing distribution of feature values of test speech to adapt it to an acoustic model of a source domain, and Cepstral Mean Normalization (CMN) and Mean and Variance Normalization (MVN) correspond thereto (for example, see Non-patent Literature 8 (“Experimental Analyses of Cepstral Coefficient Normalization Units”)).
These methods of 1. to 4. can be combined for use. The methods of 2. to 4. are techniques already established. On the other hand, though the method of 1. is an important technique as a starting point of all the processes, existing techniques belonging to the method cannot be applied to speech data collected via the Internet which has been described above.
Non-patent Literature 1 discloses a method of convoluting an impulse response at a target domain first, with clean speech of a source domain as an input and, after that, adding noise to simulate speech of the target domain (see FIG. 2A). Though this method is the most direct method for compensating channel and noise characteristics, the method is not appropriate when speech data on the Internet is source data. This is because the source data cannot be said to be clean speech, and the channel characteristic of input data is too varied for a single impulse response.
Non-patent Literature 2 discloses a mapping method using stereo data. That is, the technique of Non-patent Literature 2 requires simultaneous recording of source domain speech data and target domain speech data. When speech data to be a source is live data on the Internet, it is difficult to prepare stereo data, and therefore the method cannot be used.
Certain literature discloses a technique of constructing speech corpus of a target task by selecting speech data corresponding to the target task from existing speech corpus (see Non-patent Literature 9 (“Utterance-based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models”)).
Non-patent Literatures 10 (“Cepstral compensation using statistical linearization”) and 11 (“Speech recognition in noisy environments using first-order vector Taylor series”) are enumerated as background art showing a technique of calculating, from a Gaussian mixture model (GMM) of clean speech prepared in advance and a relational expression between the clean speech and observed speech, a GMM of the observed speech by the Vector Taylor Series (VTS) approximation.