This invention relates generally to speech token recognition systems. More particularly, the invention concerns such systems that are capable of recognizing spoken utterances, e.g. separately vocalized letters or other tokens, that are within a library developed by neural network-based learning techniques.
Speech recognition systems proliferate. Conventionally, speech recognition system development is in the area of speaker-dependent systems, and has focused upon individual user adaptation, i.e. they have been designed increasingly accurately to recognize words and phrases spoken by a particular individual in accommodation of that individual's vocalization idiosyncracies.
More recent developments in neural modeling enable higher speed and increasingly fine adjustment to speech recognition algorithms, with modestly improved separated speech token recognition accuracy and greatly improved versatility that result in part from the learning capabilities of neural model- or network-based systems. Some such developments have been reported by us in "Spoken Letter Recognition", Proceedings of the Third DARPA Speech and Natural Language Workshop, Somerset, Pa., June, 1990, which report is incorporated herein by this reference and familiarity with which is assumed.
The English alphabet is a challenging vocabulary for computer speech recognition because of the acoustic similarity of many letter pairs (e.g., B/D, B/V, P/T, T/G and M/N). Research has led to systems that perform accurate speaker-independent recognition of spoken letters using high quality or telephone speech, as described more recently in Mark Fanty and Ron Cole, "Speaker-Independent English Alphabet Recognition: Experiments with the E-Set", Proceedings of the International Conference on Spoken Language Processing, Kobe, Japan, November 1990 and Ronald Cole, Krist Roginski and Mark Fanty, "English Alphabet Recognition with Telephone Speech", Proceedings of 2nd European Conference on Speech Communication and Technology, Genova, Italy, September 1991.
There is yet a felt need for further improvement in the ability of speech-recognition systems to become speaker independent, by which is meant systems capable of recognizing the speech of a large universe of speakers having idiosyncratic speech patterns, which requires no retraining or adaptive techniques but instead accurately can interpret words or phrases spoken by an individual whose speech has never before been heard by the system. There also is need to further improve techniques used in such systems for discerning subtle differences among utterances--especially in the nature of separated, spoken letters--in order to increase the accuracy with which such utterances are classified. Finally, there is need to further develop neural-network-based systems that are more readily adaptable or trainable to a different set of recognized vocabulary entries characterized by a different set of sonorant or phonetic components, e.g. a foreign language having different vocalization patterns; and to a different set of spectral components, e.g. utterances received over a voice phone line.
Briefly, the invention is a neural-network-based system having five basic processing components: (1) data capture and signal representation utilizing spectral analysis; (2) phonetic classification of discrete time frames; (3) location of speech segments in hypothesized tokens or letters; (4) reclassification of hypothesized tokens or letters; and (5) recognition of spelled words from the classified letter scores. Importantly, phoneme classification involves a phoneme set that represents a substantial number of the tokens (or what will be referred to herein also as morphemes) in, for example, the English language alphabet, and the reclassification makes fine phonetic distinctions between difficult to discriminate tokens or letters in such vocabulary. The system is described in an application in which names spelled over phone lines are recognized with a high degree of accuracy as being within a defined library.