Conventionally, in a method for generating an acoustic model for speech recognition, a speech dataset and a correct answer text representing the speech dataset's uttered contents are used as learning data to perform a learning process (model parameter estimation) based on a criterion such as a maximum likelihood (ML) criterion, a maximum mutual information (MMI) criterion, a minimum classification error (MCE) criterion, a minimum word error (MWE) criterion, or a minimum phoneme error (MPE) criterion, thereby generating an acoustic model. Alternatively, a speech dataset and its correct answer text are used as adaptive (training) data to perform an adaptation process on an existing acoustic model. In either the learning process or adaptation process, the process is performed so that speech data of a learning or adaptive speech dataset may be successfully recognized, thus optimizing a parameter of an acoustic model (see Japanese Laid-open Patent Application Publication No. 2005-283646).
Conventionally, an acoustic model creation apparatus includes the following components. A sound analysis part extracts acoustic features from each of speech data stored in a speech data storage part. A frequency spectrum expansion/contraction part expands/contracts the frequency spectrum of these acoustic features in a frequency axis direction. An acoustic model generating part generates an acoustic model using the acoustic features, the frequency spectrum of which has been expanded/contracted or the frequency spectrum of which has not been expanded/contracted. Accordingly, if the frequency spectrum expansion/contraction is carried out with mapping using a map function by which a child speech data is obtained in a pseudo manner from an adult female speech data, for example, child acoustic features may be increased in a pseudo manner based on the adult female speech data or adult female acoustic features. Thus, even if speech data of an actual child and/or speech data for an actual unspecified speaker is not further collected, the accuracy of an acoustic model associated with child speech data may be increased, and/or the accuracy of an acoustic model for an unspecified speaker may be increased (see Japanese Laid-open Patent Application Publication No. 2003-255980).
Conventionally, a speech recognition rate estimation apparatus includes the following components. A CPU generates virtual vocalization data by performing speech synthesis using speech pieces, and simulates the influence of a sonic environment by synthesizing the generated virtual vocalization data with sonic environment data. The sonic environment data is provided by recording noise data of various environments, and is superimposed on the virtual vocalization data, thereby making it possible to bring the virtual vocalization data closer to an actual speech output environment. The CPU performs speech recognition using the virtual vocalization data by which the influence of the sonic environment is simulated, thus estimating a speech recognition rate. As for a word whose recognition rate is low, the recognition rate may be estimated by recording actual vocalization data actually uttered by a user with the use of a microphone; on the other hand, as for a word whose recognition rate is high, the recognition rate may be estimated based on the virtual vocalization data obtained by performing speech synthesis using speech pieces (see Japanese Unexamined Patent Application Publication No. 2003-177779).