The scarcity of accurately transcribed, domain-specific training data is arguably the biggest obstacle to a more widespread and successful deployment of Automatic Speech Recognition (ASR) technology, while enormous amounts of highly varied, non-domain speech data are available on the Internet and in various speech databases.
As an example, suppose a Mobile Network Operator (MNO) in Italy intends to provide its customers a service that sends automatically created transcriptions of the voicemails the customer receives, in the form of text messages. In this example, the “domain” is the MNO voicemail environment, and the domain-specific training data include the actual voicemail messages received and stored by the MNO voicemail utility.
Data security legislation or internal rules, however, may prevent the MNO from retaining the voicemails it received beyond a few days, and from making the messages available for ASR training. Furthermore, privacy concerns may require that nobody but the recipient be allowed to listen to the messages, so that manually transcribing these messages is not feasible.
But large amounts of Italian speech data are available, for example from radio and TV shows, parliamentary debates, and selected contact center data, to name just a few. ASR models trained on this data, however, perform very poorly in a voicemail environment because of a strong mismatch between the speech characteristics between domain-specific data and non-domain data. There may also be a mismatch in terms of content (e.g., topics discussed and phrases used), but such mismatches are beyond the scope of this description.
It is well known that training the acoustic models of an ASR system with accurately labeled speech data, which is well matched to the Application Target Domain (ATD), is essential for high performance speech recognition. However, in many real-life applications, it is not possible to acquire labeled speech data for ASR training directly from the application—for example, when the application is new or when privacy or security concerns prohibit the use of the data and/or the manual labeling of the data.