1.1 Technical Field
The present invention relates to speech recognition systems, and more particularly, to a computerized method and apparatus for automatically generating from a first speech recognizer a second speech recognizer which can be adapted to a specific domain.
1.2 Description of the Related Art
To achieve necessary acoustic resolution for different speakers, domains, or other circumstances, today's general purpose large vocabulary continuous speech recognizers have to be adapted to these different situations. To do so, the speech recognizer must determine a huge number of different parameters, each of which can control the behavior of the speech recognizer. For instance, Hidden Markov Model (HMM) based speech recognizers usually employ several thousands of HMM states and several tens of thousands of multidimensional elementary probability density functions (PDFS) to capture the many variations of naturally spoken human speech. Therefore, the training of a highly accurate speech recognizer requires the reliable estimation of several millions of parameters. This is not only a time-consuming process, but also requires a substantial amount of training data.
It is well known that the recognition accuracy of a speech recognizer decreases significantly if the phonetic contexts and—in consequence of the changing phonetic contexts—pronunciations observed in the training data do not properly match those of the intended application. This is especially true when dealing with dialects or non-native speakers, but also can be observed when switching to other different domains, for example within the same language or to other dialects. Commercially available speech recognition products try to solve this problem by requiring each individual end user to enroll in the system. Accordingly, the speech recognizer can perform a speaker-dependent re-estimation of acoustic model parameters.
Large vocabulary continuous speech recognizers capture the many variations of speech sounds by modelling context dependent sub-word units, such as phones or triphones, as elementary HMMs. Statistical parameters of such models are usually estimated from several hundred hours of labelled training data. While this allows a high recognition accuracy if the training data sufficiently represents the task domain, it can be observed that recognition accuracy significantly decreases if phonetic contexts or acoustic model parameters are poorly estimated due to some mismatch between the training data and the intended application.
Since the collection of a large amount of training data and the subsequent training of a speech recognizer is both expensive and time consuming, the adaptation of a (general purpose) speech recognizer to a specific domain is a promising method to reduce development costs and time to market. Conventional adaptation methods, however, either simply provide a modification of the acoustic model parameters or—to a lesser extent—select a domain specific subset from the phonetic context inventory of the general recognizer.
Facing both the industry's growing interest in speech recognizers for specific domains including specialized application tasks, language dialects, telephony services, or the like, and the important role of speech as an input medium in pervasive computing, there is a definite need for improved adaptation technologies for generating new speech-recognizers. The industry is searching for technologies supporting the rapid development of new data files for speaker (in-)dependent, specialized speech recognizers having improved initial recognition accuracy, and which require reduced customization efforts whether for individual end users or industrial software vendors.