1. Field of the Invention
The invention is directed to an arrangement and a method for the recognition of a predetermined vocabulary in spoken language by a computer.
2. Description of the Prior Art
A method and an arrangement for the recognition of spoken language are known from “Sprachunterricht—Wie funktionert die computerbasierte Spracherkennung?”, Haberland et al., c't—Magazin für Computerechnik—, Vol. 5, 1998, pp 120–125. Particularly until a recognized word sequence is obtained from a digitalized voice signal, a signal analysis and a global search that accesses an acoustic model and a linguistic model of the language to be recognized are implemented in the recognition of spoken language. The acoustic model is based on a phoneme inventory realized with the assistance of hidden Markov models (HMMs). With the assistance of the acoustic model, a suitable probable word sequences is determined during the global search for feature vectors that proceeded from the signal analysis and this is output as recognized word sequence. The words to be recognized are stored in a pronunciation lexicon together with a phonetic transcription. The relationship is explained in depth in the aforementioned Haberland et al. article.
For explaining the subsequent comments, the terms that are employed shall be briefly discussed here.
As phase of the computer-based speech recognition, the signal analysis includes a Fourier transformation of the digitalized voice signal and a feature extraction following thereupon. It proceeds from the aforementioned Haberland et al. article that the signal analysis ensues every ten milliseconds. From overlapping time segments with a respective duration of, for example, 25 milliseconds, approximately 30 features are determined on the basis of the signal analysis and combined to form as feature vector. The components of the feature vector describe the spectral energy distribution of the appertaining signal excerpt. In order to arrive at this energy distribution, a Fourier transformation is implemented on every signal excerpt (25 ms time excerpt). The components of the feature vector result from the presentation of the signal in the frequency domain. After the signal analysis, thus, the digitalized voice signal is present in the form of feature vectors.
These feature vectors are supplied to the global search, a further phase of the speech recognition. As already mentioned, the global search makes use of the acoustic model and, potentially, of the linguistic model in order to image the sequence of feature vectors onto individual parts of the language (vocabulary) present as model. A language is composed of a given plurality of sounds, referred to as phonemes, whose totality is referred to as phoneme inventory. The vocabulary is modelled by phoneme sequences and stored in a pronunciation lexicon. Each phoneme is modelled by at least one HMM. A plurality of HMMs yield a stochastic automaton that comprises statusses and status transitions. The time execution of the occurrence of specific feature vectors (even within a phoneme) can be modelled with HMMs. A corresponding phoneme model thereby comprises a given plurality of statusses that are arranged in linear succession. A status of an HMM represents a part of a phoneme (for example an excerpt of 10 ms length). Each status is linked to an emission probability, which, in particular, is distributed according to Gauss, for the feature vectors and to transition probabilities for the possible transitions. A probability with which a feature vector is observed in an appertaining status is allocated to the feature vector with the emission distribution. The possible transitions are a direct transition from one status into a next status, a repetition of the status and a skipping of the status.
A joining of the HMM statusses to the appertaining transitions over the time is referred to as trellis. The principle of dynamic programming is employed in order to determined the acoustic probability of a word: the path through the trellis is sought that exhibits the fewest errors or, respectively, that is defined by the highest probability for a word to be recognized.
The result of the global search is the output or, respectively, offering of a recognized word sequence that derives taking the acoustic model (phoneme inventory) for each individual word and the language model for the sequence of words into consideration.
The article “Speaker Adaptation Based on MAP Estimation of HMM Parameters,” Lee et al., Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, pp II-588 through II-561 discloses a method for speaker adaptation based on a MAP estimate (MAP=maximum a posteriori) of HMM parameters.
According to this Lee et al. article, it is recognized that a speaker-dependent system for speech recognition normally supplies better results than a speaker-independent system, insofar as adequate training data are available that enable a modelling of the speaker-dependent system. However, the speaker-independent system achieves the better results as soon as the set of speaker-specific training data is limited. One possibility for performance enhancement of both systems, i.e. of both the speaker-dependent as well as the speaker-independent system for speech recognition, is comprised in employing previously stored datasets of a plurality of speakers such that a small set of training data also suffices for modelling a new speaker with adequate quality. Such a training method is called speaker adaptation. In [2], the speaker adaptation is particularly implemented by a MAP estimate of the hidden Markov model parameters.
Results of a method for recognizing spoken language generally deteriorate as soon as characteristic features of the spoken language deviate from characteristic features of the training data. Examples of characteristic features are speaker qualities or acoustic features that influence the articulation of the phonemes in the form of slurring.
The approach disclosed in the Lee et al. article for speaker adaptation employs “post-estimating” parameter values of the hidden Markov models, whereby this processing in implemented “offline”, i.e. not at the run time of the method for speech recognition.
J. Takami et al., “Successive State Splitting Algorithm for Efficient Allophone Modeling”, ICASSP 1992, March 1992, pages 573 through 576, San Francisco, USA, discloses a method for recognizing a predetermined vocabulary in spoken language wherein states are split in a hidden Markov model. The probability density function of the respective states is also split therefor.