1. Field of the Invention
The present invention relates to a pattern adapting apparatus for conducting pattern adaptation processing in pattern recognition processing and, more particularly, to a pattern adapting apparatus which performs adaptation to a speaker in a voice recognition system using standard patterns and in a voice recognition system using a mixed continuous distribution model type HMM.
2. Description of the Related Art
In recent years, studies of mechanical voice pattern recognition have been conducted to propose various techniques. Among them, representative techniques widely employed are a method called DP (dynamic programming) matching and a method using Hidden Markov Model (HMM). As voice recognition systems using such voice recognition methods as the DP matching and HMM, speaker-independent voice recognition systems have been enthusiastically studied and developed which aim at recognizing anybody's voice.
With reference to FIG. 3, a voice recognition system will be described in the following with respect to a voice recognition method using HMM.
Voice of a speaker is first input to an input pattern forming unit 101 and subjected to such processing as AD conversion and voice analysis there. Then, the processed voice is converted into a time series of feature vectors on a basis of a unit of a predetermined time length called a frame.
The time series of the feature vectors is here referred to as an input pattern. Ordinary frame length ranges from 10 ms to 100 ms. Feature vectors are extraction of the quantity of features of voice spectra, which are in general of 10 to 100 dimensions.
A standard pattern storing unit 102 stores HMM. HMM is one of the models of voice information sources, whose parameters can be learned by using voice of a speaker. HMM will be more detailed when a recognizing unit 103 is described. HMM is in general prepared for each recognition unit. Here, a phoneme is taken as an example of a recognition unit. Speaker-independent voice recognition systems employ speaker-independent HMM learned in advance using voices of numbers of speakers as an HMM of the standard pattern storing unit 102.
It is assumed, for example, that 1,000 words are recognition targets. In other words, one right word is to be obtained among 1,000 candidate recognition targets. When recognizing a word, HMM of each phoneme is linked with each other to form HMM of a candidate recognition target word. With 1,000 words to be recognized, word HMM of 1,000 words are formed.
A recognizing unit 103 conducts recognition of an input pattern by using word HMM. HMM is a model of voice information sources, which adopts a statistical idea into description of standard patterns in order to cope with variations in voice patterns. Detailed description of HMM is found in the literature, "Voice Recognition by Stochastic Model" (Seiichi Nakagawa, Institute of Electronics, Information and Communication Engineers of Japan (IEICE), 1988), pp. 40-46, 55-60, 69-74 (hereinafter referred to as Literature 1).
HMM of each phoneme is made up of 1 to 10 states and their state transitions in-between. In general, a starting state and an ending state are defined. At every unit time, symbols are output at each state to cause state transition. Voice of each phoneme is represented as a time series of symbols output from HMM during the state transition from the starting state to the ending state. For each state, a symbol occurrence probability is defined and for each transition between the respective states, a transition probability is defined. Transition probability parameters represent variations of voice patterns in time output probability parameters are those regarding symbol occurrence probabilities at each state and representing variations of voice patterns in tone of voice. With a probability of a starting state fixed to a certain value, by multiplying occurrence probability and transition probability at each state transition, a probability of occurrence of a sound generated from the model can be obtained. Conversely, when a sound is observed, assuming the sound is generated from a certain HMM, its occurrence probability can be calculated. In voice recognition by HMM, HMM is prepared for each candidate recognition target, and when a sound is applied, a sound probability is obtained in each HMM, an HMM in which the probability is the highest is determined as a generation source and a candidate recognition target corresponding to the HMM is taken as a recognition result.
Out of discrete probability distribution expression and continuous probability distribution expression as output probability parameters, continuous probability distribution expression is taken here as an example. In the continuous probability distribution expression, mixed continuous distribution, that is, distribution obtained by adding a plurality of Gaussian distributions with weights, is used. Such parameters as output probability parameters, transition probability parameters and weights of a plurality of Gaussian distributions are learned in advance by the algorithm called Baum-Welch Algorithm, which provides a learning voice corresponding to a model. The Baum-Welch Algorithm is detailed in the Literature I. In the following example, the output probability is expressed by a mixed continuous probability distribution.
Processing to be conducted at the time of word recognition will be explained by the following formula. An input pattern X expressed as a time series of feature vectors is represented as: EQU X=x.sub.1, x.sub.2, . . . , x.sub.t, . . . , x.sub.T ( 1)
Here, T represents the total number of frames of an input pattern. Candidate recognition target words are denoted as W.sub.1, W.sub.2, . . . , W.sub.N. N represents the number of candidate recognition target words. Matching between a word HMM for each word W.sub.n and an input pattern X is carried out using the following procedure. In the following, suffix n will be omitted unless it is necessary.
First, with respect to a word HMM, transition probability from a state j to a state i is represented as a.sub.ji, mixed weight of an output probability distribution as .lambda..sub.im, a mean vector of each element Gaussian distribution (referred to as a frame distribution) as .mu..sub.im and covariance matrix as .SIGMA..sub.im. Here, t represents an input time, i and j represent a state of HMM and m represents a mixed element number. Then, the following recurrence formulas regarding forward probability .alpha.(i,t) will be operated. EQU .alpha.(i, o)=.pi..sub.i i=1, . . . , I; t=1, . . . , T (2) ##EQU1## Here, .pi..sub.i represents a probability of the initial state being i and b.sub.i (x) and N(x;.mu..sub.im, .SIGMA..sub.im) will be defined by the following expressions. ##EQU2## EQU N(x;.mu..sub.im, .SIGMA..sub.im)=(2.pi.).sup.-n/2 .vertline..SIGMA..sub.im .vertline..sup.-1/2 exp(-(.mu..sub.im -x)).SIGMA..sup.-1 (.mu..sub.im -x)/2) (5)
Likelihood of an input pattern for the word W.sub.n, is obtained by the following expression. EQU P.sup.n (X)=.alpha.(I, T) (6)
I represents a final state. This processing will be executed for each word model. Recognition resultant word W.sub.n for the input pattern X will be given by the following expression. EQU n=argmax.sub.n P.sup.n (X) (7)
Obtained recognition resultant word is sent to a recognition result outputting unit 104.
The recognition result outputting unit 104 executes such processing as outputting a recognition result on a screen and sending a control instruction corresponding to a recognition result to other units.
Unlike a speaker-dependent system targeting a specific user, a speaker-independent recognition system has an advantage that it is unnecessary for a user to register his/her speaking in advance. However, the system has a drawback that for almost every speaker, recognition performance is lower than that of a speaker-dependent system. Another shortcoming is that there exists a speaker (peculiar speaker) for which recognition performance is especially low. In order to solve these problems, studies have been conducted are under way for applying speaker adaptation techniques that have been conventionally used in speaker-dependent systems also to speaker-independent systems.
Speaker adaptation is a system for adapting a recognition system to a new user (unknown speaker) by using a less amount of data for adaptation than that for learning. In the speaker adaptation, a standard pattern of the standard pattern storing unit is modified to improve the recognition performance with respect to unknown speakers (indicated by the dotted line in FIG. 31. Speaker adaptation system is explained in detail in the literature "Speaker Adaptation Technique in Voice Recognition" (Sadaki Furui, The Institute of Television Engineers of Japan, 1989, vol. 43, No. 9, pp. 929-934).
Speaker adaptation is roughly categorized into two methods. One is speaker adaptation with teacher and the other is speaker adaptation without teacher. Here, teacher denotes a phonemic notation sequence indicative of the contents of applied sounds. Adaptation with teacher is an adaptation method that is to be employed when a phonemic notation sequence for input sounds is known, and which needs to designate a vocabulary to be sounded to an unknown speaker in advance at the time of adaptation.
On the other hand, adaptation without teacher is an adaptation method that is to be employed when a phonemic notation sequence for input sounds is unknown, and which has no constraints on the contents of a sound to be applied by an unknown speaker. In other words, it is unnecessary to indicate the contents of sounds to an unknown speaker. It is therefore possible to conduct adaptation by using an applied voice whose recognition is under way, without having an unknown speaker notice it. In general, the adaptation without teacher has lower recognition performance after the execution of adaptation as compared with the adaptation with teacher. Adaptation with teacher is therefore more commonly used at present.
As described in the foregoing, speaker adaptation techniques have been adopting a system of converting an initial standard pattern into a standard pattern for each speaker. Parameters of a model for speaker adaptation for use in this standard pattern conversion have been conventionally constant regardless of the number of data.
Conventional speaker adaptation techniques therefore have drawbacks that in a model having a large number of parameters, estimation will be unstable with a small amount of data and that in a model having a small number of parameters, recognition performance fails to improve with a large amount of data.