The present invention relates to mapping between a speech signal and a transcript of the speech signal.
To train an acoustic model (AM) used for speech recognition, speech data aligned with a transcript of the speech data is required. The speech data may be aligned with the transcript by time indices each indicating which time range of the speech data corresponds to which phone of the transcript. An accuracy of the alignment has a big impact on a quality of the acoustic model. This alignment is difficult when the speech data relates to a long speech, and it is desirable that the speech data relates to a speech of several tens of seconds (e.g., 30 seconds) at most for the alignment. Thus, the speech data is usually segmented into utterances by referring to pauses, and then the utterances are transcribed.
Some web sites may store many pairs of the speech data and the transcript. However, most of them are not necessarily segmented into utterances of lengths appropriate for the alignment. In addition, some portions in the transcript are sometimes modified or deleted for better readability, so straightforward aligning method cannot be applicable.