There are numerous technologies wherein computers analyze or make use of human speech. For example, speech recognition software exists which will convert an oral message into a text format through speech-to-text software or by correlating the speech parameters of an oral message with “reference” speech parameters in order to interpret a participant's oral response.
There are also in existence several systems for computerizing the entire voice process. Both mechanical and electronic systems have been successfully employed in the design of computer voices. Computer hardware known as “voice platforms” or “text-to-speech engines” are well known in the art. For example, automated voice messaging systems, such as described in U.S. Pat. Nos. 6,487,533 and 6,483,899, retrieve a text message and automatically generate a language identifier corresponding to the text message. The text message is converted into computer-generated speech in a language corresponding to the language identifier. The systems then store or receive oral messages and convert the oral message to a text message using speech recognition software, and transmit the text message to an intended recipient. The oral message may also be sent as an attachment to the message.
Further, Interactive Voice Response (also known as IVR) is a mature technology that has been used for decades. Its primary functions have been to collect information from telephone callers, process that information and supply information to the caller. Over the years, Interactive Voice Response has developed into a highly productive tool that automates many processes that would otherwise require the time and expense of human beings to complete the task.
For example, Interactive Voice Response has been used extensively by businesses to answer consumers' questions, such as a bank providing account information using a computer over the telephone. Interactive Voice Response also includes telemarketing systems that are becoming more and more prevalent in society. These systems include a database of stored messages which are transmitted to homes across the country. The messages include advertisements and notices which can automatically provide information to 10s of thousands of listeners at a time. Similarly, automated response engines are used by companies to answer questions and provide information to callers. More complicated automated response engines are capable of asking questions and automatically recording responses. The caller responses may be provided by voice articulation, or formed by callers pressing buttons on the telephone to select various prerecorded responses combined with information that may be retrieved from some other database system based on the user's request and their account information.
There has not been a system to determine the truthfulness, credibility, or intent of multiple callers simultaneously that can also leverage the capabilities of an Interactive Voice Response system or other multiple simultaneous audio input devices, such as Voice Over Internet Protocol (VoIP), Session Initiated Protocol (SIP), wireless networks such as 80211.a, 80211.b, 80211.g, 80211.x, open air, to PDAs or other wireless device, Satellite, 3G, GSM, CDMA, TDMA, Cellular, etc. Nor has there been a system that can automate the information gathering process for truthfulness, credibility or intent of multiple callers. Callers have simply provided information without any analysis as to whether the information is credible or whether the information is being submitted for dishonest purposes. As a first example, it would be advantageous if a bank could automatically detect the credibility of persons accessing bank records as a criteria for determining whether the person is authorized to access the records. Speaker intent systems could also be used by the bank as part of their criteria to determine whether the person is performing authorized bank transactions, or to determine whether the caller should be subjected to additional checks and balances before completing the transaction. Another example might be in the initial opening of an insurance claim to identify the overall honesty or fraud risk or that particular applicant/claim.
As an additional example, Interactive Voice Response systems are sometimes used to transact sales of products over the phone. Currently, such systems do not provide any analysis as to whether the buyer is authorized to use billing information, such as a credit card. More simply, though these automated systems are capable of providing information and soliciting information from a great number of callers at one time, these systems do not assess the credibility of callers or screen callers as to the truthfulness or intent of their responses, nor do they combine this information with other known pieces of information available from either an UVR or other data process device in an automated fashion in order to create even higher accuracies in their assessment.
Human speech is generated by the vocal cords and by turbulence as expelled air moves through the vocal tract creating a resonance of the cavities in the head, the throat, the lungs, the mouth, the nose, and the sinus cavities. Previous experiments show three types of voice-change as a result of stress. The first of these usually manifests itself in audible perceptible changes in speaking rate, volume, voice tremor, spacing between syllables, and fundamental pitch or frequency of the voice. The second type of voice change is not discernible to the human ear, but is an apparently unconscious manifestation of the slight tensing of the vocal cords under even minor stress, resulting in a dampening of selected frequency variations. When graphically portrayed, the difference is readily discernible between unstressed or normal vocalization and vocalization under mild stress, attempts to deceive, or adverse attitudes. These patterns have held true over a wide range of human voices of both sexes, at various ages, and under various situational conditions. The third is an infrasonic, or subsonic, frequency modulation which is present, in some degree, in both the vocal cord sounds and in the formant sounds. This signal is typically between 8 and 12 Hz. Accordingly, it is not audible to the human ear. Due to the fact that this characteristic constitutes frequency modulation, as distinguished from amplitude modulation, it is not directly discernible on time-base/amplitude chart recordings. However, this infrasonic signal is one of the more significant voice indicators of psychological stress. In addition, some voice based lie detection applications of current invention also employ artificial intelligence and neural networks to get an emotional reading of the person's intent.
There are in existence systems for recognizing emotions in speech. Thereby, numerous methods such as, neural networks and ensembles of classifiers, are utilized. For example, a voice authentication algorithm utilizing a neural network voice authentication algorithm is described in U.S. Pat. No. 5,461,697. Meanwhile, classifiers use pitch and linear predictive coding (LPC) parameters (and usually other excitation information too) for analyzing or encoding human speech signals is described in U.S. Pat. Nos. 6,427,137 and 6,463,415.
Many particular methods of voice analysis can be selected within the general framework of LPC modeling. For example, pitch or a format frequency are common analytes, which correspond to resonances of the vocal tract, which in turn corresponds to the frequency at which the larynx modulates the air stream.
Alternatively, U.S. Pat. No. 4,093,821 describes an approach wherein formant frequency distribution patterns are analyzed to produce a first output indicative of the nulls in the FM demodulated signal, a second output representing the duration of the nulls, and a third output proportional to the ratio of the total duration of nulls during a word period to the total length of the word period. The ratio is used to discriminate between theatrical emphasis and stress.
Commercial vendors of voice stress analyzers in the United States include, but are not limited to: The National Institute for Truth Verification, CCS International, Diogenes Group, Risk Technologies, and Nemesysco, as well as, Makh-Shevet in Israel. Other names used to refer to voice stress analysis (VSA) include but are not limited to: CVSA—Computerized Voice Stress Analyzer (analysis), Lie detector, Truth Detector, Narrative Analysis, emotional analysis, psychological analysis, psychological stress evaluation (PSE), Rich Psychological Analysis, Credibility Assessment.
The disadvantage of the known approaches to voice credibility assessment is that large-volume applications are presently impractical and not economical. For example, known approaches require personnel to operate the systems on a one-to-one basis and/or these systems cannot process large volumes of voice samples or simultaneous voice samples. In addition, specialized equipment and software must be installed at the local computer for each person performing analysis.
Thus, there is a need for a speaker intent and credibility solution that is automated, and is capable of recording and analyzing the responses of persons located anywhere around the globe where a communications link can be established. It would further be desirable for an automated system that was capable of simultaneously analyzing the responses of numerous persons at one time. Additionally, it would desirable if the automated system could be applied to various applications in a variety of fields, such as: insurance, unemployment, disability, welfare, homeland security, parole management, call centers and customer relationship management, security in general, banking, legal, credit card fraud, general fraud prevention, employment screening, sales priority assessment, predictive analysis, etc. In addition, dynamic prompts could be generated from a business process based on the real-time analysis of speaker. For example, a questionable answer would cause the system to prompt for further detail or solicit more information about the suspect response.