The invention concerns a computer system for speech recognition with means for producing fenemic baseforms of words, where the fenemic baseform of a word comprises a number of acoustic labels. Means are provided for storing the words and their associated fenemic baseforms.
In speech recognition computer systems known in the art, a text spoken by a speaker is converted by the computer System into the corresponding character form. The spoken text is recognized by comparing acoustic information obtained by the computer system from a previously spoken training text with acoustic information derived by the computer system from the text to be recognized. The acoustic information from the spoken training text includes a word table, in which all words known to, and therefore recognizable by, the computer system are stored. For each word in the word table, the acoustic information includes the word's fenemic baseform, which has been produced in a known manner from the spoken training text on a word basis by the so-called "growing" algorithm or Viterbi alignment. (Lalit R. Bahl et al. "Constructing Markov Models of Words From Multiple Utterances." U.S. Pat. No. 4,759,068.)
If a spoken text to be recognized contains a new word not contained in the word table, it is not possible for the computer system to recognize this new word correctly.
Previously, it was necessary to produce the fenemic baseform of the new word from a new spoken training text containing the word and to store the word in the word table. If the new word subsequently occurred in a text to be recognized, it was found in the thus expanded word table and therefore identified. The effort required, particularly in speaking the new training text, is clearly much too extensive.
In an article entitled "Automatic Construction of Fenemic Markov Word Models For Speech Recognition" by M. Ferretti et al (IBM Technical Disclosure Bulletin, Vol. 33, No. 6b, Nov. 1990, pp. 233-237), another method is disclosed which attempts to produce more easily the fenemic baseform of a new word previously unknown to the computer system and to store this word in the word table. For this purpose, the new word is first converted into its phonetic baseform. The phonetic baseform is then divided into so-called triphones, whereby a triphone is a sequence of three consecutive sounds of the phonetic baseform. The triphones contained in the new word are searched for in the words already in the word table in order to employ the leafemic baseform of the middle sound of the retrieved triphone in place of the corresponding position of the fenemic baseform of the new word. The fenemic baseform of the new word is thus constructed from a number of leafemic baseforms. If a triphone of the new word is not contained in the word table, the search for similar triphones is continued with the aid of similarity matrices. In particular, the processing of these similarity matrices can require a large amount of computation time under unfavorable conditions.