A. Voice Controlled Messaging Systems
In voice messaging systems (VMS) coupled to the public switched telephone network, a user (or "subscriber") usually controls the VMS by pressing push buttons by hand on a dual tone multi-frequency (DTMF) keypad of a telephone set. This method of data input is inconvenient, and even dangerous, when the user's hands, eyes, or both, are busy.
For example, when a user is operating a car phone or cordless telephone the user may be occupied by other tasks (such as driving a car) which make manual data input difficult or dangerous. Other circumstances include use of the telephone while typing, using a computer terminal, or working at a workbench. Owners of rotary dial telephones cannot generate DTMF signals needed by most VMSs. Moreover, persons having impaired sight and persons who lack hands or have other physical handicaps may find using telephone sets difficult or impossible. In all these cases, voice commands are the only convenient means of interacting with and controlling a VMS.
General functions of VMSs are well known, as shown in U.S. Pat. Nos. 4,352,807 and 4,371,752 (Matthews et al.) which disclose voice-store-and-forward systems. In most prior art systems, the user controls all or most of the functions of a VMS by manual input of DTMF digits. For example, in the Matthews et al. '807 patent, DTMF keypresses are required for some system functions such as enrollment of message recipients. In both the '807 and '752 patents, the VMS requires DTMF input and responds with "beep" sounds rather than digitized voice prompts. FIG. 16 of the '752 patent indicates that the '752 system requires DTMF digits for user identification.
Prior attempts to automate VMSs have focused on elements of a system but fail to automate the entire system. For example, U.S. Pat. No. 5,048,074 (Dugdale) simply replaces DTMF pushbuttons with foot switches.
Text to speech (TTS) conversion is a known means for supplying a text or e-mail message to a caller, as exemplified by U.S. Pat. Nos. 4,716,583 and 4,659,877. However, prior TTS systems have required use of DTMF digits to configure and operate the system, as shown in FIGS. 3a and 3b of the '583 patent. Similar systems, exemplified by U.S. Pat. No. 4,996,707, enable conversion of a facsimile (fax) document into ASCII text for routing to a TTS system. This enables audible playback of a fax. However, the '707 and similar systems have all required entry of DTMF digits for control.
Voice command systems with limited capabilities are also known, as exemplified by U.S. Pat. No. 5,051,924. This system and others requires DTMF dial-up of a VMS rather than voice command access to messages in the VMS.
Prior voice messaging systems also tend to require excessive computation resources, since in typical systems, a single digital signal processor (DSP) IC, the DSP must perform many voice processing functions besides hands free control. U.S. Pat. No. 4,974,191 is typical of computation-intensive voice response systems. In a typical VMS, very few DSP machine cycles are available just for voice control. Thus, those of skill in the art would appreciate an efficient implementation to allow other voice-band activity of significant computational cost to run concurrently.
Another desirable feature is to have hands free processing available on all voice ports of a VMS so that any user can use hands free processing. Yet another desired feature is real time response to voice commands. The prior art fails to provide these features. For example, typical performance of the AT&T VMS, which is well known in the art, is 8.8 seconds to verify a spoken password. In contrast, one embodiment of the present invention has operated with response times of less than one-half second. Yes/no recognition has been measured at under 700 ms.
Another disadvantage of the prior art is that performance parameters are not completely configurable, i.e., the parameters cannot be changed to other values while the messaging system is operational. This is a disadvantage since configurability can be used to optimize the parameters to the desired level of performance for the available processing power and to match characteristics of the location or site of the system.
Those skilled in the art would also appreciate a totally voice controlled messaging system implemented on a general-purpose digital signal processor (DSP) which serves multiple channels of voice-band activity while using a maximum number of processor cycles for voice control processing.
B. Speaker Verification
Speaker verification methods are also known in the art, as exemplified by U.S. Pat. No. 5,056,150. The general object of speaker verification is to establish a digitally stored template for a particular speaker uttering a selected, uninterrupted word ("feature extraction"), and then upon subsequent trials to estimate the confidence level associated with the same speaker uttering the same word ("pattern matching"). Feature extraction performs transformations on the speech signal to yield a template that represents the signals being compared. Pattern matching makes a comparison between a stored template and a template generated for an input signal, and yields numeric results about the proximity of the two templates. In both processes a primary goal is eliminating undue statistical variation among separate trials. Speech recognition also involves other discrimination tasks, but the present invention relates most directly to the closeness of match between the template and the new utterance.
The prior art of speaker verification generally treats feature extraction and pattern matching separately. In general, prior art methods do not relate to a combination of feature extraction and pattern matching, which combination is disclosed in the present invention. Moreover, in the present invention feature extraction is accomplished using smoothed group delay function (SGDS) and pattern matching for speaker variation using the hidden Markov model (HMM), a combination not known in the art.
The central function of feature extraction is to transform a brief time frame of the speech signal into a feature vector. A straightforward method is to measure the average energy of the signal over a given time frame. The same process is repeated for all the time frames of interest (such as the time needed to utter a phrase). A two-dimensional pattern is produced, which may be compared to a similarly-generated one. This time-energy method can discriminate between short and long phrases, or between speech and non-speech, but cannot recognize words or identify speakers.
A better method of feature extraction is to separate the signal into frequency components ("spectral analysis"). This can be done with bandpass analog filters, or in a digital signal processor by the Fourier transform. Instead of a single value for each time slice as in time-energy analysis, spectral analysis yields either a set of amplitude envelopes for each frequency analyzed. The resulting template is like a topographic map, in which the goal is to match the location and height of peaks.
Spectral analysis can discriminate words of a language, but is poor for discriminating between speakers. Further, it is strongly affected by passing the signal through a channel that does not have a "flat" frequency response, and is affected by noise, both of which are problems in telephony.
"Cepstrum" analysis has been applied to signals containing echoes. Like Fourier analysis, it yields a spectral representation, but the independent variable is time difference (lag) instead of frequency. Its computation is approximately the same as two Fourier transforms and a nonlinear expansion. Its benefit is that the resulting lag spectrum, or cepstrum, may separate the effects of three or more sources of a difference in speech timbre, thereby enhancing discrimination among speakers.
The group delay spectrum is a differently weighted but similarly derived form of spectral analysis, and is described in Itakura & Umezaki, "Distance measure for speech recognition based on the smoothed group delay spectrum", IEEE Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 1987, pp. 1257-1260. It can yield a more prominent set of features for matching.
Performance of cepstrum and group delay can be varied by changing parameters. Manipulation of two scalar values "s" and "tau" of Equation (7) of Itakura et al. can reconfigure one into the other, or either into another spectrum. The effect is like tuning a piano. In the prior art conventional windowing techniques are known to reduce effects of sampling or finite interval selection. The terms "windowing" and "smoothing" are often used interchangeably.
In the prior art hidden Markov modeling (HMM) is used to establish an assumption about the underlying behavior of a physical process. In HMM's the Baum-Welch, or "forward-backward" method is the central part of a solution to the model, but to be complete, the remainder of a solution must be specified. Poritz, "Hidden Markov models: a guided tour" ICASSP, IEEE, 1988, pp. 7-13, Section 7 describes use of the hidden Markov model (HMM) and the Baum-Welch method in general speech processing. As noted in Poritz FIG. 6 and its accompanying text, use of the method must be preceded by selecting (either randomly or deterministically) initial seed values for the auxiliary function "Q", then application of Baum-Welch, then assessment of whether a critical point has satisfactorily been reached, then reiteration as needed.
As is known in the art, hidden Markov models can be characterized by order and number of states. Usually an increased order coupled with an increased number of states increases the computational burden more than the product of the two. Those of skill in the art would find useful a processing system which provides a lesser processing load when the order and number of states are both increased.
Prior speaker verification methods have been used with a single-microphone, e.g., for entry to a secure room. Those of skill in the art would appreciate a speaker verification method optimized for use with the telephone network and which can cope with variation in performance of telephone lines and microphones.