Depending on the application with which a speech recognition system is being employed, it may be quite useful to be able to utilize a speech recognition system originally trained in accordance with a first spoken language when building a speech recognition system for recognizing utterances input in a second spoken language. However, there are several issues that must be considered before such an operation can be implemented due to differences between the first language and the second language. From an acoustic or phonetic point of view, differences between the two languages may be characterized as falling into one of three types of cases.
In a first case, some sounds in the two languages may be similar. This is a fortuitous case since the speech recognition system originally trained in the first language would likely be able to recognize certain words uttered in the second language, preferably with some training, due to their acoustic or phonetic similarities.
In a second case, some sounds in the first or “base” language may not occur in the second or “new” language. In this case, such sounds can be ignored when building the new language speech recognition system.
In a third case, some sounds may not be present in the base language but may exist in the new language. This is obviously the most difficult case. A known technique used to handle this case involves building the new language speech recognition system via “bootstrapping” from a well-trained speech recognition system, i.e., the base language sounds are bootstrapped to produce the closest matching sound in the new language.
Bootstrapping is a very common technique used to generate the initial phone models for a new language recognition system, see, J. Kohler, “Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds,” ICSLP, 2195-2198, 1996; and O. Anderson, P. Dalsgaard and W. Barry, “On the use of data-driven clustering technique for identification of poly- and mono-phonemes for four European languages,” ICASSP, 1/121-1/124, 1994, the disclosures of which are incorporated by reference herein. Models for all the new sounds are built using the bootstrapping procedure and then the speech data in the new language is aligned to these models. However, since the new sounds of the new language do not have the models that are built from the same sounds, the alignment is not proper. Moreover, when building the context dependent models, the sounds in all the contexts can not be formed by uttering the particular sounds in all the contexts while collecting the data. Also, the new language data can not be labeled until there exists a system which could align this new language data. Using a base language recognition system for the new language speech is also seen in MA Chi Yuen and Pascale Fung, “Adapting English Phoneme Models for Chinese Speech Recognition,” ISCSLP, 80-82, December 1998; and T. A. Faruquie, C. Neti, N. Rajput, L. V. Subramaniam, A. Verma, “Translingual Visual Speech Synthesis,” IEEE International Conference on Multimedia and Expo (ICME 2000), New York, USA, Jul. 30-Aug. 2, 2000, the disclosures of which are incorporated by reference herein.
Consider this problem as one wherein we have an n-dimensional space and there are clusters of points in this space, each of which are from a particular context of the occurrence of a sound. Each point in this n-dimensional space represents a particular utterance of sound. Since the n-dimensions have been chosen to be such that the acoustic characteristics of the sound are best represented and that the sounds which represent the same phoneme/arc/context form a well-clustered non-overlapping group, the space becomes the best representation of sounds that occur in the language for which the space has been trained. Bootstrapping for a new language then involves a regrouping of these points into clusters so that each cluster now represents the phone/arc/context of the new language. Each cluster is then modeled by an appropriate function and this model then becomes a representation of the particular phone/arc/context for the new language.
This technique however has its limitations. For example, there are areas in this space wherein there are no points. These represent the sounds that do not occur in the base language. If the new language has sounds that fall in this region, bootstrapping cannot provide solutions to model the phone/arc/context of this region. But since these are initial models which get better over iterations, bootstrapping still is used widely to build the initial models. Moreover, in the speech context, the points may be so close in the space that the voids can be filled up by forming a huge space spanning these voids too, albeit, through a very crude model.
Another possible solution may be that one can attempt to “speak up” or train those phones in all contexts and form features which represent the points in the space and thereby fill up the voids. However, since each point in the space is typically representative of 10 milliseconds of speech, speaking such isolated utterances is not possible. So, this solution has its drawback in that the sounds can be uttered in the new language but it is not possible to label each of these 10 millisecond utterances as one of the phones in a particular context.
Accordingly, there is a need for data labeling techniques which permit the generation of a speech recognition system for one spoken language (i.e., a new language) based on a speech recognition system originally generated for another spoken language (i.e., a base language) which overcome these and other limitations associated with conventional techniques.