Speech recognition systems are well known to the art. Examples include the IBM Tangora ("A Maximum Likelihood Approach to Continuous Speech Recognition;" L. R. Bahl, F. Jelinek, R. Mercer; Readings in Speech Recognition; Ed.: A. Waibel, K. Lee; Morgan Kaufmann, 1990; pp. 308-319.) and Dragon Systems Dragon 30k dictation systems. Typically, they are single user, and speaker-dependent systems. This requires each speaker to train the speech recognizer with his or her voice patterns during a process called "enrollment". The systems then maintain a profile for each speaker who must identify himself or herself to the system in future recognition sessions. Typically speakers enroll via a local microphone in a low noise environment, speaking to the single machine on which the recognizer is resident. During the course of enrollment, the speaker will be required to read a lengthy set of transcripts, so that the system can adjust itself to the peculiarities of each particular speaker.
Discrete dictation systems, such as the two mentioned above, require speakers to form each word in a halting and unnatural manner, pausing between each word. This allows the speech recognizer to identify the voice pattern associated with each individual word by using preceding, and following silences to bound the words. The speech recognizer will typically have a single application for which it is trained, operating on the single machine, such as Office Correspondence in the case of the IBM Tangora System.
Multi-user environments with speaker dependent speech recognizers require each speaker to undertake tedious training of the recognizer for it to understand his or her voice patterns. While it has been suggested that the templates which store the voice patterns may be located in a common database wherein the system knows which template to use for a speech recognition by the speaker telephone extension, each speaker must none-the-less train the system before using it. A user new to the system calling from an outside telephone line will find this procedure to be unacceptable. Also, the successful telephonic speech recognizer will be capable of rapid context switches to allow speech related to various subject areas to be accurately recognized. For example, a system trained for general Office Correspondence will perform poorly when presented with strings of digits.
The Sphinx system, first described in the Ph.D Dissertation of Kai-Fu Lee ("Large Vocabulary Speaker and Dependent Continuous Speech Recognition: The Sphinx System;" Kai-Fu Lee; Carnegie Mellon University, Department of Electrical and Computer Engineering; April 1988; CMU-CS-88-148), represented a major advance over previous speaker dependent recognition systems in that it was both speaker independent, and capable of recognizing words from a continuous stream of conversational speech. This system required no individualized speaker enrollment prior to effective use. Some speaker dependent systems require speakers to be re-enrolled every four to six weeks, and require users to carry a personalized plug-in cartridge to be understood by the system. Also with continuous speech recognition, no pauses between words are required, thus the Sphinx system represents a much more user friendly approach to the casual user of a speech recognition system. This will be an essential feature of telephonic speech recognition systems, since the users will have no training in how to adjust their speech for the benefit of the recognizer.
A speech recognition system must also offer real time operation with a given modest vocabulary. However, the Sphinx System still had some of the disadvantages of the prior speaker dependent recognizers in that it was programmed to operate on a single machine in a low noise environment using a microphone and a relatively constrained vocabulary. It was not designed for multi-user support, at least with respect to the different locations, and multiple vocabularies for recognition.
Conventional speech processing systems commonly employ a speech recognition module which transforms input signals representing speech utterances into discrete representations that are compared to stored digital representations (templates) of expected words or speech sound units. The input speech signals are "recognized" usually by using a statistical algorithm to measure and detect a match to a corresponding word or sound template. Speech processing systems and algorithms are usually designed for one or more particular modes of operation, e.g., speaker-dependent or independent speech recognition, text- or application-dependent or independent speech recognition, speaker verification (authentication of identity), speaker recognition (selection from a number of candidates), or speaker monitoring (identity, direction, etc.). The design of such systems can vary widely with the application, speaker vocabulary, syntax, or environment of use.
Over the past several years, speech processing technology has achieved a level of performance sufficient to admit the introduction of successful commercial products. Development work continues to further improve the accuracy, reduce the vulnerability, and expand the capabilities of such systems. However, progress toward improvement has been limited by the available tools for system and algorithm development.
One factor limiting progress is that error rates have become low enough, for example, in text-dependent speaker verification, that a large test must be performed to ascertain whether an improvement has been made. To illustrate, if the probability of false acceptance is on the order of 1/1000, and the test is designed to observe 30 errors, then 30,000 trials are needed. Performing such a test using a simulation running on a time-sharing computer could take weeks or months. To mitigate this problem, tests may be run using a fast special-purpose hardware implementation of the recognition algorithm. However, this leads to a second problem, i.e. making changes to the algorithm may be very difficult because of the constraints imposed by the hardware or software.
A third important factor is that the recognition system itself influences the user's speaking behavior. This influence is absent if the user's speech input is prerecorded and the user does not have a real-time interaction with the system. The environment in which the system is installed, the details of the user interface, and the feedback of past acceptance or rejection decisions can all affect the user's interaction with the system. Thus, valid testing in the intended environment of use requires a real-time implementation of the recognition algorithm and an accurate simulation of the user interface.
In many institutions the phone calls placed by a patient/client or prison inmate are primarily, if not exclusively, collect calls. Collect calls initiated by a patient/client must be indicated as such to the called party. In addition, calls placed by an inmate to an outside party often begin with a prerecorded message stating that the call or collect call is from "a prison" and is being placed by "prisoner's name." In the above cases the called party is usually asked to dial a digit, commonly a "0" or a "1", to accept the call or the attendant charges. The phone system providing such service must be able to detect such acceptance both as a dual-tone-multi-frequency ("DTMF") tone response from a "Touch-Tone" phone as well as to detect the equivalent response on a pulse-dial telephone. ("Touch-Tone" is a trademark of the AT&T corporation.)
The clients/inmates in some institutions may be allowed to call only numbers on a pre-authorized list in order to deter fraudulent activity. A prison phone system, for example, must be able to detect the called party's flashing the hook switch in order to prevent the called party from activating three-way (i.e., conference) calling, dialing another number and then connecting the prisoner to an unauthorized phone number.
Accordingly, a need has arisen for a telecommunications system which can automate and simplify the processes currently handled by a traditional automated operator service (AOS). Specifically, a need has risen for telephone call handling equipment which can automatically route local and long distance calls without the intervention of an outside service or live operator, and which enables the telephone owner/service provider to charge for the completion of a call or collect call while preventing three-way calling.