The invention is directed to a method for recognizing a keyword in spoken language.
A modelling of the complete spoken expression has hitherto always been required in the recognition of a keyword in spoken language. The person skilled in the art is familiar with essentially two methods:
M. Weintraub, xe2x80x9cKeyword-spotting using SRI""s DECIPHER large-vocabulary speech-recognition systemxe2x80x9d, Proc. IEEE ICASSP. Vol. 2, 1993, pp. 463-466 discloses a method for the recognition of a keyword that employs a speech recognition unit with a large vocabulary. The attempt is thereby made to completely recognize the spoken language. Subsequently, the recognized words are investigated for potentially existing keywords. This method is complex and affected with errors because of the large vocabulary and because of the problems in the modelling of spontaneous vocal expressions and noises, i.e. part of the voice signal that cannot be unambiguously allocated to a word.
Another method employs specific filler models (also: garbage models) in order to model parts of expressions that do not belong to the vocabulary of the keywords (what are referred to as OOV parts; OOV=out of vocabulary). Such a speech recognition unit is described in H. Boulard, B. D""hoore and J.-M. Boite, xe2x80x9cOptimizing recognition and rejection performance in wordspotting systemsxe2x80x9d, Proc. IEEE ICASSP, vol. 1, 1994, pages 373-376, and comprises the keywords as well as a filler model or a plurality of filler models. One difficulty is to design or train a suitable filler model that contrasts well with the modelled keywords, i.e. exhibits high discrimination with respect to the keyword models.
Further, hidden Markov models (HMMs) are known from L. R. Rabiner, B. H. Juang, xe2x80x9cAn Introduction to Hidden Markov Modelsxe2x80x9d, IEEE ASSP Magazine, 1986, pp. 4-16, or A. Hauenstein, xe2x80x9cOptimierung von Algirthmen und Entwurf eines Prozessors fxc3xcr die automatische Spracherkennungxe2x80x9d, Doctoral Dissertation at the Chair for Integrated Circuits of the Technical University, Munich, Jul. 19, 1993, pp. 13-35. It is also known from Rabiner et al or Hauenstein to define a best path with the Viterbi algorithm.
Hidden Markov models (HMMs) serve the purpose of describing discrete stochastic processes (also called Markov processes). In the field of speech recognition, hidden Markov models serve, among other things, for building up a word lexicon in which the word models constructed of sub-units are listed. Formally, a hidden Markov model is described by:
xcex=(A, B, xcfx80)xe2x80x83xe2x80x83(0-1)
with a quadratic status transition matrix A that contains status transition probabilities Aij:
xe2x80x83A={Aij} with i,j=1, . . . ,Nxe2x80x83xe2x80x83(0-2)
and an emission matrix B that comprises emission probabilities Bik:
B={Bik} with i=1, . . . ,N; k=1, . . . ,Mxe2x80x83xe2x80x83(0-3)
An n-dimensional vector xcfx80 serves for initialization, an occurrence probability of the N statusses for the point in time t=1 defined:
xcfx80={xcfx80i}=P(s(1)=si)xe2x80x83xe2x80x83(0-4)
In general,
P(s(t)=qt)xe2x80x83xe2x80x83(0-5)
thereby indicates the probability that the Markov chain
s={s(1),s(2),s(3), . . . ,s(t), . . . }xe2x80x83xe2x80x83(0-6)
is in status qt at time t. The Markov chain s thereby comprises a value range
s(t)xcex5{s1,s2, . . . ,sN}xe2x80x83xe2x80x83(0-7)
whereby this value range contains a finite set of N statusses. The status in which the Markov process is at time t is called qt.
The emission probability Bik derives from the occurrence of a specific symbol "sgr"k in the status si as
Bik=P("sgr"k|qt=si)xe2x80x83xe2x80x83(0-8)
whereby a character stock xcexa3 having the size M comprises symbols "sgr"k (with k=1 . . . M) according to
xcexa3={"sgr"1,"sgr"2, . . . ,"sgr"M}(0-9)
A status space of hidden Markov models derives in that every status of the hidden Markov model can have a predetermined set of successor statusses: itself, the next status, the next but one status, etc. The status space with all possible transitions is referred to as trellis. Given hidden Markov models of the order 1, a past lying more than one time step in the past is irrelevant.
The Viterbi algorithm is based on the idea that, when one is locally on an optimum path in the status space (trellis), this is always a component part of a global optimum path. Due to the order 1 of the hidden Markov models, only the best predecessor of a status is to be considered, since the poorer predecessors have received a poorer evaluation in advance. This means that the optimum path can be sought recursively time step by time step beginning from the first point in time, in that all possible continuations of the path are identified for each time step and only the best continuation is selected.
A respective modelling of the OOV parts is required given the methods described in Weintraub and Boulard et al. In the former instance of Weintraub, the words of the expression must be explicitly present in the vocabulary of the recognition unit; in the latter instance of Boulard et al, all OOV words and OOV noises are presented by specific filler models.
The object of the invention is comprised in specifying a method that enables the recognition of a keyword in spoken language, whereby the above-described disadvantages are avoided.
According to the method of the invention for recognizing a keyword in spoken language, the keyword is represented by a sequence of statuses W of hidden Markov models. The spoken language are sampled with a predetermined rate and a feature vector Ot is produced at every sampling time t for a voice signal from the spoken language belonging to the sampling time t. The sequence O of feature vectors Ot are imaged onto the sequence of the statuses with a Viterbi algorithm, whereby a local confidence standard is calculated on the basis of an emission standard at a status. With the Viterbi algorithm, a global confidence standard is supplied. The keyword in the spoken language is recognized when the following applies:
A method for recognizing a keyword in spoken language, comprising the steps of representing the keyword by a sequence of statuses W of hidden Markov models; sampling the spoken language with a predetermined rate and providing a feature vector Ot at every sampling time t for a voice signal from the spoken language belonging to the sampling time t; imaging a sequence O of feature vectors Ot onto the sequence of statuses with a Viterbi algorithm, whereby a local confidence standard is calculated on the basis of an emission standard at a status; with the Viterbi algorithm supplying a global confidence standard; recognizing the keyword in the spoken language when the following applies C(W, O) less than T,
where
C( ) denotes the confidence standard,
W denotes the keyword, presented as a sequence of statuses,
O denotes the sequence of feature vectors Ot,
T denotes a predetermined threshold.
Otherwise, the keyword in the spoken language is not recognized.
One advantage of the invention is comprised that a keyword is recognized within the spoken language without the expression having to be modelled overall. As a result thereof, a clearly reduced expense derives in the implementation and, accordingly, a higher-performance (faster) method. By employing the (global) confidence standard C as the underlying decoding principle, the acoustic modelling within the decoding procedure is limited to the keywords.
One development is that a new path through the status space of the hidden Markov models in a first status of the sequence of statusses W begins at each sampling time t. As a result thereof, it is assumed at every sampling time that a beginning of a keyword is contained in the spoken language. On the basis of the confidence standard, feature vectors resulting from following sampling times are imaged onto those statusses of the keyword represented by hidden Markov models. A global confidence standard derives at the end of the imaging, i.e. at the end of the path, with reference whereto a decision is made as to whether the presumed beginning of the keyword was in fact such a beginning. If yes, the keyword is recognized; otherwise, it is not recognized.
Within the scope of a development of the invention, the global confidence standard C is defined by
C=xe2x88x92log P(W|O)xe2x80x83xe2x80x83(2)
and the corresponding local confidence standard c is defined by                               c          =                                    -              log                        ⁢                                          P                (                                                      O                    t                                    ⁢                                                            "LeftBracketingBar"                                              s                        j                                            )                                        ·                                          P                      ⁡                                              (                                                  s                          j                                                )                                                                                                                        P                ⁡                                  (                                      O                    t                                    )                                                                    ,                            (        3        )            
whereby
sj denotes a status of the sequence of statusses,
P(W|O) denotes a probability for the keyword under the condition of a sequence of feature vectors Ot,
P(Ot|sj) denotes the emission probability,
P(sj) denotes the probability for the status sj,
P(Ot) denotes the probability for the feature vector Ot.
A suitable global confidence standard is characterized by the property that it provides information about the degree of a dependability with which a keyword is detected. In the negative logarithmic range, a small value of the global confidence standard C expresses a high dependability.
Within the scope of an additional development, the confidence standard C is defined by                     C        =                              -            log                    ⁢                                    P              (                              O                ⁢                                  "LeftBracketingBar"                  W                  )                                                                    P              (                              O                ⁢                                  "LeftBracketingBar"                                      W                    _                                    )                                                                                        (        4        )            
and the corresponding local confidence standard is defined by                     c        =                              -            log                    ⁢                                    P              (                                                O                  t                                ⁢                                  "LeftBracketingBar"                                      s                    j                                    )                                                                    P              (                                                O                  t                                ⁢                                  "LeftBracketingBar"                                                            s                      j                                        _                                    )                                                                                        (        5        )            
whereby
P(O|{overscore (W)}) denotes the probability for the sequence of feature vectors Ot under the condition that the keyword W does not arrive,
{overscore (sj)} denotes the counter-event for the status sj (i.e.: not the status sj).
The advantage of the illustrated confidence standards is comprised, among other things, in that they can be calculated, i.e. no prior training and/or estimating is/are required.
The definition of the local confidence standards can be respectively derived from the definitions of the global confidence standards. Local confidence standards enter into the calculation of the confidence standard for a keyword at those points in time that coincide in time with the expression of this keyword.
The local confidence standards can be calculated with the relationships                               P          ⁡                      (                          O              t                        )                          =                              ∑            k                    ⁢                      xe2x80x83                    ⁢                      P            (                                          O                t                            ⁢                                                "LeftBracketingBar"                                      s                    k                                    )                                ·                                  P                  ⁡                                      (                                          s                      k                                        )                                                              ⁢                              xe2x80x83                            ⁢              and                                                          (        6        )                                P        (                                            O              t                        ⁢                          "LeftBracketingBar"                                                s                  j                                _                            )                                =                                    ∑                              k                ≠                j                                      ⁢                          xe2x80x83                        ⁢                          P              (                                                O                  t                                ⁢                                                      "LeftBracketingBar"                                          s                      k                                        )                                    ·                                      P                    ⁡                                          (                                              s                        k                                            )                                                                                                                              (        7        )            
Further, it is possible to determine P(Ot) or, respectively, P(Ot|{overscore (sj)}) with suitable approximation methods. An example of such an approximation method is the averaging of the n-best emissions xe2x88x92log P(Ot|sj) at every time t.
The decoding procedure is usually implemented with the assistance of the Viterbi algorithm:             C              t        ,                  s          j                      =                  min        k            ⁢              (                              C                                          t                -                1                            ,                              s                k                                              +                      c                          t              ,                              s                j                                              +                      a            kj                          )              ,
where
Ct,sj denotes the global, accumulated confidence standard at time t in the status sj,
Ctxe2x88x921,sk denotes the global, accumulated confidence standard at the time txe2x88x921 in the status sk,
ct,sj denotes the local confidence standard at the time t in the states sj,
akj denotes a transition penalty from the status Sk into the status Sj.
Since no local confidence standards outside the time limits of the keyword are required for a presentation of the global confidence standard for a keyword, an acoustic modelling of the OOV parts can be foregone in the search for the keyword.
By applying the Viterbi algorithm with the possibility of starting a new path in the first status of a keyword at every time t, whereby the keyword is preferably subdivided into individual statusses of a hidden Markov model (HMM), the global confidence standard is optimized for every keyword and, at the same time, the optimum starting time is determined (backtracking of the Viterbi algorithm).
For a predetermined time span, it is also expedient to also seek a minimum below the threshold T. Multiple recognition of a keyword within this predetermined time span is thereby avoided.
When there are keywords that are similar to one another in view of their descriptive form represented by the respective sequence of statusses, then it is useful to utilize a mechanism that, given recognition of a keyword, precludes that another keyword was partially contained in the spoken voice signal in the time span of the recognized keyword.
Exemplary embodiments of the invention are presented with reference to the following Figures.