The invention relates to a method for recognizing an input pattern stored in a user station using a recognition unit of a server station; the server station and the user station being connected via a network; the recognition unit being operative to recognize the input pattern using a model collection of at least one recognition model; the method comprising:
performing an initial recognition enrolment step, comprising transferring model improvement data associated with a user of the user station from the user station to the recognition unit; and associating the user of the user station with a user identifier; and
for a recognition session between the user station and the server station, transferring a user identifier associated with a user of the user station and an input pattern representative of time sequential input generated by the user from the user station to the server station; and using the recognition unit to recognize the input pattern by incorporating at least one recognition model in the model collection which reflects the model improvement data associated with the user.
The invention further relates to a pattern recognition system comprising at least one user station storing an input pattern and a server station comprising a recognition unit; the recognition unit being operative to recognize the input pattern using a model collection of at least one recognition model; the server station being connected to the user station via a network;
the user station comprising means for initially transferring model improvement data associated with a user of the user station and a user identifier associated with the user to the server station; and for each recognition session between the user station and the server station transferring a user identifier associated with a user of the user station and an input pattern representative of time sequential input generated by the user to the server station; and
the server station comprising means for, for each recognition session between the user station and the server station, incorporating at least one recognition model in the model collection which reflects the model improvement data associated with a user from which the input pattern originated; and using the speech recognition unit to recognize the input pattern received from the user station.
Pattern recognition systems, such as large vocabulary continuous speech recognition systems or handwriting recognition systems, typically use a collection of recognition models to recognize an input pattern. For instance, an acoustic model and a vocabulary may be used to recognize words and a language model may be used to improve the basic recognition result. FIG. 1 illustrates a typical structure of a large vocabulary continuous speech recognition system 100 [refer L. Rabiner, B-H. Juang, xe2x80x9cFundamentals of speech recognitionxe2x80x9d, Prentice Hall 1993, pages 434 to 454]. The system 100 comprises a spectral analysis subsystem 110 and a unit matching subsystem. In the spectral analysis subsystem 110 the speech input signal (SIS) is spectrally and/or temporally analysed to calculate a representative vector of features (observation vector, OV). Typically, the speech signal is digitised (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis. Consecutive samples are grouped (blocked) into frames, corresponding to, for instance, 32 msec. of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (LPC) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector). The feature vector may, for instance, have 24, 32 or 63 components. The standard approach to large vocabulary continuous speech recognition is to assume a probabilistic model of speech production, whereby a specified word sequence W=w1w2w3 . . . wq produces a sequence of acoustic observation vectors Y=y1y2y3 . . . yT. The recognition error can be statistically minimised by determining the sequence of words w1w2w3 . . . wq which most probably caused the observed sequence of observation vectors y1y2y3 . . . yT (over time t=1, . . . , T), where the observation vectors are the outcome of the spectral analysis subsystem 110.
This results in determining the maximum a posteriori probability:
max P(W¦Y), for all possible word sequences W By applying Bayes"" theorem on conditional probabilities, P(W¦Y) is given by:
P(W¦Y)=P(Y¦W).P(W)/P(Y)
Since P(Y) is independent of W, the most probable word sequence is given by:
arg max P(Y¦W).P(W) for all possible word sequences W(1)
In the unit matching subsystem 120, an acoustic model provides the first term of equation (1). The acoustic model is used to estimate the probability P(Y¦W) of a sequence of observation vectors Y for a given word string W. For a large vocabulary system, this is usually performed by matching the observation vectors against an inventory of speech recognition units. A speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit. A word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references. For systems, wherein a whole word is represented by a speech recognition unit, a direct relationship exists between the word model and the speech recognition unit. Other systems, in particular large vocabulary systems, may use for the speech recognition unit linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. For such systems, a word model is given by a lexicon 134, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 132, describing sequences of acoustic references of the involved speech recognition unit. A word model composer 136 composes the word model based on the sub-word model 132 and the lexicon 134. FIG. 2A illustrates a word model 200 for a system based on whole-word speech recognition units, where the speech recognition unit of the shown word is modelled using a sequence of ten acoustic references (201 to 210). FIG. 2B illustrates a word model 220 for a system based on sub-word units, where the shown word is modelled by a sequence of three sub-word models (250, 260 and 270), each with a sequence of four acoustic references (251, 252, 253, 254; 261 to 264; 271 to 274). The word models shown in FIG. 2 are based on Hidden Markov Models (HMMs), which are widely used to stochastically model speech and handwriting signals. Using this model, each recognition unit (word model or subword model) is typically characterised by an HMM, whose parameters are estimated from a training set of data. For large vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 words, usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. An HMM state corresponds to an acoustic reference (for speech recognition) or an allographic reference (for handwriting recognition). Various techniques are known for modelling a reference, including discrete or continuous probability densities.
A word level matching system 130 matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints are placed on the matching by using the lexicon 134 to limit the possible sequence of sub-word units to sequences in the lexicon 134. This reduces the outcome to possible sequences of words. A sentence level matching system 140 uses a language model (LM) to place further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model. As such the language model provides the second term P(W) of equation (1). Combining the results of the acoustic model with the language model, results in an outcome of the unit matching subsystem 120 which is a recognized sentence (RS). The language model used in pattern recognition may include syntactical and/or semantical constraints 142 of the language and the recognition task. A language model based on syntactical constraints is usually referred to as a grammar 144. The grammar 144 used by the language model provides the probability of a word sequence W=w1w2w3 . . . wq, which in principle is given by:
P(W)=P(w1)P(w2¦w1).P(w3¦w1w2) . . . P(wq¦w1w2w3 . . . wq).
Since in practice it is infeasible to reliably estimate the conditional word probabilities for all words and all sequence lengths in a given language, N-gram word models are widely used. In an N-gram model, the term P(wj¦w1w2w3 . . . wjxe2x88x921) is approximated by P(wj¦wjxe2x88x92N+1 . . . wjxe2x88x921). In practice, bigrams or trigrams are used. In a trigram, the term P(wj¦w1w2w3 . . . wjxe2x88x921) is approximated by P(wj¦wjxe2x88x922wjxe2x88x921).
Similar systems are known for recognising handwriting. The language model used for a handwriting recognition system may in addition to or as an alternative to specifying word sequences specify character sequences.
User independent pattern recognition systems are provided with user independent recognition models. In order to achieve an acceptable level of recognition, particularly large vocabulary recognition systems are made user dependent by training the system for a specific user. An example of such a system is the Philips SP 6000 dictation system. This system is a distributed system, wherein a user can dictate directly to a user station, such as a personal computer or workstation. The speech is recorded digitally and transferred to a server station via a network, where the speech is recognized by a speech recognition unit. The recognized text can be returned to the user station. In this system the acoustic references of the acoustic model are trained for a new user of the system by the new user dictating a predetermined text, with an approximate duration of 30 minutes. This provides sufficient data to the server station to enable building an entirely new set of acoustic references for the user. After this enrolment phase, the user may dictate text. For each dictation session, the recognition unit in the server station retrieves the acoustic references associated with the dictating user and uses these to recognize the dictation. Other recognition models, such as a lexicon, vocabulary, language model are not trained to a specific user. For these aspect, the system is targeted towards only one specific category of users, such as legal practitioners, physicians, surgeons, etc.
The relatively long duration of training hinders acceptance of the system by users which would like to use the system occasionally or for a short time. Moreover, the relatively large amount of acoustic references which needs to be stored by the server station for each user makes the system less suitable for large numbers of users. Using the system for dictating a text in a different field than aimed at by the language model and vocabulary could result in a degraded recognition result.
It is an object of the invention to enable pattern recognition in a client-server configuration, without an undue training burden on a user. It is a further object of the invention to enable pattern recognition in a client-server configuration, where the server is capable of simultaneously supporting recognition for many clients (user stations). It is a further object to enable pattern recognition for a wide range of subjects.
To achieve the object, the method according to the invention is characterised in that the server comprises a plurality of different recognition models of a same type; in that the recognition enrolment step comprises selecting a recognition model from the plurality of different recognition models of a same type in dependence on the model improvement data associated with the user; and storing an indication of the selected recognition model in association with the user identifier; and in that the step of recognising the input pattern comprises retrieving a recognition model associated with the user identifier transferred to the server station and incorporating the retrieved recognition model in the model collection.
By storing a number of recognition models of a same type, e.g. a number of language models each targeted towards at least one different subject, such as photography, gardening, cars, etc., a suitable recognition model can be selected for a specific user of the system. This allows good quality recognition. In this way, a user is not bound to one specific type of recognition model, such as a specific language model or vocabulary, whereas at the same time the flexibility of the system is achieved by re-using models for many users. For instance, all users which have expressed an interest in photography can use the same language model which covers photography. As such this flexibility and the associated good recognition result provided by using a user-oriented recognition model is achieved without storing a specific model for each user.
Advantageously, also the amount of training data which needs to be supplied by the user can be substantially smaller than in the known system. Instead of requiring a sufficient amount of data to fully train a model or to adapt an already existing model, according to the invention the amount of data needs only to be sufficient to select a suitable model from the available models.
The plurality of recognition models of a same type is formed by a basic recognition model and a plurality of adaptation profiles. A recognition model is selected by choosing an appropriate adaptation profile and adapting the basic model using the chosen adaptation profile. For instance, a basic language model may cover all frequently used word sequences of a language, whereas the adaptation profile covers word sequences for a specific area of interest. The adapted language model may then cover both the commonly used and the specific sequences. In this way it is sufficient to store only one basic model (of a given type) and a number of, usually much smaller, adaptation profiles.
The model improvement data comprises acoustic training data, such as acoustic references. Based on the acoustic training data a suitable acoustic model is selected or a basic acoustic model is adapted using a suitable adaptation profile. A simple way of achieving this is to recognize a relatively short utterance of a user (e.g. limited to a few sentences) with a range of different acoustic models. Each of the models is, preferably, targeted towards a specific type of speech, such as female/male speech, slow speech/fast speech, or speech with different accents. The acoustic model which gave the best result is then selected.
The acoustic model adaptation profile comprises a matrix for transforming an acoustic references space or a set of acoustic references to be combined with acoustic references used by the basic acoustic model. In this way the acoustic model can be adapted in an effective way.
The model improvement data comprises language model training data. In a preferred embodiment, the language model training data comprises at least one context identifier. Preferably, the context identifier comprises or indicates a keyword. Based on the training data, a language model or language model adaptation profile is selected.
The model improvement data comprises vocabulary training data, such as a context identifier, allowing selection of a corresponding vocabulary or vocabulary adaptation profile used for adapting a basic vocabulary.
The context identifier comprises or indicates a sequence of words, such as a phrase or a text. At least one keyword is extracted from the sequence of words and the selection of the model or adaptation profile is based on the extracted keyword(s).
To achieve the object, the pattern recognition system is characterised in that the server station comprises a plurality of different recognition models of a same type; means for selecting a recognition model from the plurality of different recognition models of a same type in dependence on the model improvement data associated with the user; and for storing an indication of the selected recognition model in association with the user identifier; and means for retrieving a recognition model associated with the user identifier transferred to the server station and for incorporating the retrieved recognition model in the model collection.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings.