This invention relates to speech recognition and, more particularly, to apparatus and methods for identifying mismatches between assumed pronunciations of words, e.g., from transcriptions, and actual pronunciations of words, e.g., from acoustic data.
Speech recognition systems are being used in several areas today to transcribe speech into text. The success of this technology in simplifying man-machine interaction is stimulating the use of the technology in several applications such as transcribing dictation, voicemail, home banking, directory assistance, etc. Though it is possible to design a generic speech recognition system and then use it in a variety of different applications, it is generally the case that if the system is tailored to the particular application being addressed, it is possible to obtain much better performance than the generic system.
Most speech recognition systems consist of two components: an acoustic model that models the characteristics of speech, and a language model that models the characteristics of the particular spoken language. The parameters of both these models are generally estimated from training data from the application domain of interest.
In order to train the acoustic models, it is necessary to have acoustic data along with the corresponding transcription. For training the language model, it is necessary to have the transcriptions that represent typical sentences in the selected application domain.
Hence, with the goal of optimizing the performance in the selected application domain, it is often the case that much training data is collected from the domain. However, it is also often the case that only the acoustic data can be collected in this manner, and the data has to be transcribed later, possibly by a human listener. Further, it is the case that where spontaneous speech is concerned, it is relatively difficult to obtain verbatim transcriptions because of the existence of several mispronunciations, inconsistencies and errors in the speech, and the human transcription error rate is fairly high. This in turn has an implication on the estimation of the acoustic model parameters and, as is known, transcriptions with a significant amount of errors often lead to poorly estimated or corrupted acoustic models.
Accordingly, it would be highly advantageous to provide apparatus and methods to identify regions of the transcriptions that have errors. Then, it would be possible to post-process these regions, either automatically or by a human or a combination thereof, in order to refine or correct the transcriptions in this region alone.
Further, in most speech recognition systems, it is generally the case that words in the vocabulary are represented as a sequence of fundamental acoustic units such as phones (referred to as the baseform of the word). Also, it is often the case that the baseform representation of a word does not correspond to the manner in which the word is actually uttered. Accordingly, it would also be highly advantageous to provide apparatus and methods to identify such mismatches in the baseform representation and actual acoustic pronunciation of words.
Further, it is often the case that in spontaneous speech, due to co-articulation effects, the concatenation of the baseform representation of a group of words may not be an appropriate model, and it may be necessary to construct a specific baseform for the co-articulated word. For example, the phrase xe2x80x9cgoing toxe2x80x9d may commonly be pronounced xe2x80x9cgonna.xe2x80x9d Accordingly, it would also be highly advantageous to provide apparatus and methods for such a co-articulated word to be detected and allow for a specific baseform to be made for it (e.g., a baseform for xe2x80x9cgonnaxe2x80x9d) rather than merely concatenating the baseforms of the non-co-articulated phrase (e.g., concatenating baseforms of words xe2x80x9cgoingxe2x80x9d and xe2x80x9ctoxe2x80x9d).
Lastly, there may also be inconsistencies between a transcription and input acoustic data due to modeling inaccuracies in the speech recognizer. Accordingly, it would be highly advantageous to provide apparatus and methods for erroneous segments in the transcription to be identified, so that they can be corrected by other means.
The present invention provides apparatus and methods to identify mismatches between some given acoustic data and its supposedly verbatim transcription. It is to be appreciated that the transcription may be, for example, at the word level or phone level and the mismatches may arise due to, for example, inaccuracies in the word level transcription, poor baseform representation of words, background noise at the time the acoustic data was provided, or co-articulation effects in common phrases. The present invention includes starting with a transcription having errors and computing a Viterbi alignment of the acoustic data against the transcription. The words in the transcription are assumed to be expressed in terms of certain basic units or classes such as phones, syllables, words or phrases and the acoustic model is essentially composed of models for each of these different units. The process of Viterbi aligning the data against the transcription and computing probability scores serves to assign a certain probability to each instance of a unit class in the training data. Subsequently, for each class, a histogram of the scores of that class is computed from all instances of that class in the training data. Accordingly, the present invention advantageously identifies those instances of the class that correspond to the lowest scores in the histogram as xe2x80x9cproblem regionsxe2x80x9d where there is a mismatch between the acoustic data and the corresponding transcription. Subsequently, the transcription or baseform can be refined for these regions, either automatically or manually by a human listener, as will be explained. It is to be appreciated that the invention is applicable to identification of mismatches between a transcription and acoustic data associated with a training session or a real-time decoding session.
In one aspect of the invention, a method for identifying mismatches between acoustic data and a corresponding transcription, the transcription being expressed in terms of basic units, comprises the steps of: aligning the acoustic data with the corresponding transcription; computing a probability score for each instance of a basic unit in the acoustic data with respect to the transcription; generating a distribution for each basic unit; tagging, as mismatches, instances of a basic unit corresponding to a particular range of scores in the distribution for each basic unit based on a threshold value; and correcting the mismatches.
In another aspect of the invention, computer-based apparatus for identifying mismatches between acoustic data and a corresponding transcription associated with a speech recognition engine, the transcription being expressed in terms of phonetic units, comprises: a processor, operatively coupled to the speech recognition engine, for: aligning the acoustic data with the corresponding transcription; computing a probability score for each instance of a basic unit in the acoustic data with respect to the transcription; generating a distribution for each basic unit; tagging, as mismatches, instances of a basic unit corresponding to a particular range of scores in the distribution for each basic unit based on a threshold value; and correcting the mismatches.