This invention relates generally to a feature space for a speech recognition system and, more especially to discriminatively training a feature-space transform.
A state of the art automatic speech recognition (ASR) system is usually trained with more than a few hundred speakers in a target domain to provide robustness. Since ASR performance is highly dependent on the acoustic environment in the target domain, an acoustic model (AM) in the system should ideally be built with a large amount of target domain data.
Modern auto speech recognition (ASR) systems are trained with a large amount of training data. There are two types of training data which can be used for training the ASR systems: one is manually transcribed data; and another is automatically transcribed data.
Manually-transcribed data can be used ideally for AM training. However, transcription of a big data by human involves enormous costs. Therefore, only a limited amount of field data is transcribed and is utilized for an AM training.
Automatically transcribed data are very effective and can be generated with less cost than manually-transcribed data because they complement speaker and environmental variations including speaking styles which are not in the transcribed data. There are several ways to generate high quality automatically transcribed data. However, they still include ASR (transcription) errors. The transcription errors are harmful for discriminative training (DT) techniques because the discriminative training techniques are based on a distance measure between recognition results and corresponding transcriptions.
If a reference text corresponding to each utterance contains errors, a part of statistics for DT is mistakenly accumulated and gives a negative impact.