Speech processing for recognition, speaker identification and speaker verifications and/or other such decoding h as been greatly improved and now accurate techniques are well known in the art. The following several papers discuss the various speech processing referenced above and provide a representation of the state of this art:
1) A paper entitled, "State of the Art in Continuous Speech Recognition," was published in the Proceedings of the National Academy of Science, USA, Vol. 92, pp. 9956-9963, Oct. 1995, authored by John Makhoul and Richard Schwartz. This paper is hereby incorporated by reference herein as if laid out in full. PA0 2) A paper entitled, "Speaker Verification IR&D Final Report," was prepared by BBN Laboratories Inc., in Nov. 1987 and authored by Richard Schwartz, Alan Derr, Alexander Wilgus and John Makhoul. This paper is hereby incorporated by reference herein as if laid out in full. PA0 3) A paper entitled, "Identification of Speakers engaged in Dialog," published in the IEEE International Conference on Acoustics, Speech, and Signal Processing, held in Minneapolis, Minn., on Apr. 27-30 of 1993. This paper is hereby incorporated by reference herein as if laid out in full. PA0 4) A paper entitled, "A Tutorial of Hidden Markov Models and Selected Applications in Speech Recognition," published by the IEEE in 1989 by Lawrence Rabiner. The IEEE log number is 8825949. This paper is hereby incorporated by reference herein as if laid out in full.
The Makhoul and Schwartz paper (item 1 above) provides a good basis for understanding the technical aspects that underlie the present invention, a brief review of this paper follows. The authors wrote the paper under the auspices of BBN Systems and Technology, Cambridge Mass., the same assignee of the present patent. The paper discloses three major factors in speech recognition, linguistic variability, speaker variability and channel variability. Channel variability includes the effects of background noise and the transmission apparatus, e.g. microphone, telephone, echoes, etc. The paper discusses the modeling of linguistic and speaker variations. An approach to speech recognition is to use a model, a logical finite-state machine where transitions and outputs are probabilistic, to represent each of the groups of three (or two) phonemes found in speech. The models may have the same structure but the parameters in the models are given different values. In each model there is a hidden Markov model (HMM). HMM is a statistical artifact that is well discussed in the above paper and the references listed therein, and is not be described in depth herein. FIG. 5 of the paper describes an approach to speech recognition. The system is trained by actual speakers articulating words continuously. The audio signal is processed and features are extracted. The signal is often smoothed by filtering by hardware or by software (if digitized and stored), followed by mathematical operations on the resulting signal to form features which are computed periodically, say every 10 milliseconds or so. Continuous speech is marked by sounds or phonemes that are connected to each other. The two adjacent phonemes on either side of a given phoneme have a major effect, referred to as co-articulation, on the articulation of the center phonemes. Triphoneme is the name given to the different articulation of a given phoneme due to the affects of these side phonemes. The continuous speech is divided into discrete transformed segments that facilitate the several mathematical operations. Many types of features have been used including, time and frequency masking, taking of inverse Fourier transforms resulting in a mathematical series of which the coefficients are retained as a feature vector. The features are handled mathematically as vectors to simplify the training and recognition computations. Other features may include volume, frequency range, and amplitude dynamic range. Such use of vectors is well known in the art, and reference is found the Makhoul and Schwartz paper on page 9959 et seq. The spoken words used in the training are listed in a lexicon and a phonetic spelling of each word is formed and stored. Phonetic word models using HMM's are formed from the lexicon and the phonetic spellings. These HMM word models are iteratively compared to the training speech to maximized the likelihood that the training speech was produced by these HMM word models. The iterative comparing is produced by the Baum-Welch algorithm which is guaranteed to converge to form a local optimum. This algorithm is well known in the art as referenced in the Makhoul and Schwartz paper on page 9960. A grammar is established and with the lexicon a single probabilistic grammar for the sequences of phonemes is formed. The result of the recognition training is that a particular sequence of words will corresponds with a high probability to a recognized sequence of phonemes. Recognition of an unknown speech begins with extracting the features as in the training stage. All word HMM model sequences allowed by the grammar are searched to find the word (and therefore the triphoneme) sequence with the highest probability of generating that particular sequence of feature vectors. Prior art improvements have included development of large databases with large vocabularies of speaker independent continuous speech for testing and development. Contextual phonetic models have been developed, and improved recognition algorithms have been and are being developed. Probability estimation techniques have been developed and language models are being improved. In addition, computers with increased speed and power combined with larger, faster memories have improved real time speech recognition. It has been found that increased training data reduces recognition errors, and tailored speaker dependent training can produce very low error rates.
There are examples of speech recognition applications applied to the Internet. One such example is IBM's Merlin speech recognition system. Merlin interfaces with the NETSCAPE NAVIGATOR.RTM., web-browse so that text generated from speech by Merlin is available to others via the Internet. A limitation of this system is that the grammar, vocabulary and recognizer are all resident on the client's computer. In the case of a client running on a laptop or other such small computer the speech recognition may be quite limited. Another issue with the set up of Merlin and other recognition systems resident at the client is that all updates and other changes must be made to each client and there may be literally millions of such clients. Even though updates and the like could be made available over the Internet, it remains an inconvenience.
Texas Instruments offers an Internet based speech recognition system called SAM which requires the speech recognizer software to reside at the client. SAM is a Mosaic browser which has been modified to include a speech recognizer. However, the grammar is, in effect distributed and down loaded when a Web page for specific topics is entered, called a "smart page." For example, a weather report page could have a grammar specific to words and phrases associated with the weather. An example would be a grammar that recognizes such utterances as "Show me the weather for Boston," yielding a weather report for Boston. An artifact of such a system is that the vocabularies and grammars are small and the system cannot accommodate large vocabularies and grammars associated with speech recognition in general. Vocabularies and grammars for such general systems are large, in the order of 150 MBytes which are too large to down load at run time. SAM re-acquires that the user's browser be replaced by one modified to contain a Web browser.
Another limitation of systems where the speech recognition resides at the client is that the arrangement is not conducive to speaker identification and/or verification. Since the speech recognition is at the client, the identity and verification is not likely to be secure. If the speech recognition is at a single server location that one location serving many clients can be made secure.
Another Internet-based speech recognition system (SLAM) was developed by the Oregon Graduate Institute (OGI). This system sends compressed digitized speech over the Internet to a remote server where a recognizer resides. In this system the speech is entirely received before beginning the recognition process. SLAM retains the speech at the client which serves the purpose of allowing better speech compression and relieves the recognizer from having to determine some over-head-like functions such as the end-of-speech. But, as a result the speech recognition cannot function in real-time. The speech compression of SLAM is data compression where communication bandwidth is reduced from sending raw digitized speech, but there remains a bandwidth limitation using SLAM where excessive time delays may occur between the speech and the delivered text.
Also, SLAM does not mention or refer to speaker identification nor to speaker verification.
It is an object of the present invention to provide a speech processing system distributed between a client and a server operating in a "streaming" or real-time continuous mode. A related object is to provide real-time speech processor distributed over a network, e.g. the Internet.
It is another object of the present invention to provide a speech processing system useful with low bandwidth communications channels and where the client computer is a laptop or other such computer with limited memory and/or speed.
Another object of the present invention is to provide control information between the client and the server over a communications link that directs and augments the speech processing.