As the hardware and software to record conversations in digital form has become more and more affordable over recent years, recording and archiving of conversations such as customer service calls, teleclasses, business teleconferences, and calls made by prison inmates has become routine. As digital voice recorders have become more economical and easier to use, their use for dictation and note taking has been steadily increasing. It is expected that with the increasing availability of portable digital devices capable of audio recording (such as MP3 player/recorders, cell phones, and digital voice recorders) will continue to increase for many years to come, and that the uses of these devices will continue to expand. Indeed, we are approaching the time when audio recording of one's entire lifetime experience will be practical and economical. As the amount of monitored and stored audio data increases, there is an ever increasing need for technological tools which can extract information from digital audio data. Background information on a number of conversation-recording market segments, as well as background information on speech recognition, voice verification, and voice identification is presented below.
Prison Market
Modern correctional institutions face many challenges concerning phone calls made by inmates. One intent of correctional institutions is to restrict inmates from making phone calls to persons they should not be contacting. To help accomplish this aim, many modern prison phone systems require inmates to use an identifying PIN to use the phone system, and the phone system enforces limitations on numbers which can be directly dialed, based on the prisoner's individual PIN. Many prison phone systems, for instance, limit a given prisoner to dialing one of a pre-approved set of phone numbers when making calls. One example of the type of phone call inmates are not allowed to make is phone calls to other convicted criminals (either in other prisons, or out on parole). Another example of the type of phone call inmates are not allowed to make is phone calls which threaten, intimidate, or harass someone. Another type of phone call prisoners have occasionally made which have embarrassed officials is a phone call to a public radio station, where the prisoner winds up speaking on the radio without authorization. One way in which inmates circumvent the limited-dialing restrictions of modern prison phone systems is to make a phone call to an “allowed party”, who has invited over to his or her residence a “disallowed party”, who then participates in the phone call after the phone is initially answered by the allowed party. Another way in which inmates circumvent the limited-dialing restrictions of modern prison phone systems is to make a phone call to a friend who has three-way calling, and then have that person conference in a person at another telephone number (which is not approved for the prisoner to be dialing). Another way that inmates circumvent the dialing limitations is to have someone at an allowed number set their phone to call-forward to another number (which is not approved for the prisoner to be dialing). One brand of prison phone systems boast a “third-party-call-indicating click detector” feature, which is designed to detect a set of supposedly telltale click sounds on the line when a third party is conferenced on to the line. Such detectors are unfortunately unreliable at best, because many modern telephone systems don't create any particular noises on the line when conferencing in a third party, or when forwarding a call to another number, but none the less, prison officials have been motivated by the promise of such systems enough to purchase phone system upgrades. Indeed, word of the existence of such systems has spread among inmates, along with the belief and story that if inmates in a conversation (where a third party is being conferenced in) are making enough noise at the time of the conferencing clicks, then the system will not detect the clicks.
To continue to market “conference call click detecting” systems to prisons in the face of such stories, manufacturers of such systems have utilized phone system hardware that separates the electronic signals produced by the phone the inmate is talking on at the prison, from the electronic signals coming in from the outside phone network. Telecommunications with the incoming and outgoing signals separated is sometimes referred to as four-wire telephony (in contrast to the two-wire telephony systems typically used in homes, where incoming and outgoing signals share the same pair of wires). We will also refer to this four-wire technique in this document as “telephonic directional separation”. When click detection algorithms are run on only the signals coming in from the outside phone network, clicks can be detected (if they exist) regardless of how much noise a prisoner makes on a phone at the prison. In addition to click detection methods, tone detection methods such as those described in U.S. Pat. No. 5,926,533 (which is herein incorporated by reference) are known in the art. However, if a given outside phone system accomplishes call conferencing without creating clicks or tones, the call conferencing cannot be detected through click or tone detection. There is a need for innovative technology which can detect conference calls in situations where no tell-tale clicks or tones are present.
A compounding problem facing corrections facilities today is that detecting and automatically disconnecting a call based on the fact that it is a conference call or a forwarded call may not be the right thing to do in some circumstances. For instance, if someone an inmate is allowed to call at home sets his home phone to forward to his cell phone if he doesn't answer at home, the call should be allowed to go through. Likewise, if one person an inmate is allowed to call wants to conference in another person that the inmate is allowed to call, such a call should not be automatically disconnected. There is a need for innovative technology which will not interrupt conference calls and forwarded calls which should be allowed to take place, while automatically disconnecting instances of call forwarding and conference calling which should not be allowed to take place.
The quantity of phone calls made on a daily basis from modern correctional institutions is large, and even though many correctional institutions record all phone calls made by inmates, it is a financially infeasible task to manually review, spot monitor, or manually spot review all phone calls made, and even if such manual monitoring were feasible, persons monitoring the calls would be unlikely to know if a given call was forwarded to someone at a different number than the number that was dialed, and the entire call might have to be listened to in order to detect an instance of conferencing in a third party. There is a need for more automated monitoring with innovative features which would statistically allow a high degree of accuracy in pinpointing phone calls which went to an un-allowed party.
Even when inmates are talking to allowed parties, it is desirable to prevent inmates from facilitating illegal activity via their phone calls. Techniques (such as described in U.S. Pat. No. 6,064,963, which is herein incorporated by reference) are known in the art for automatically spotting key words in conversations. Unfortunately it can be difficult to know what key words to look for, because inmates know that all their calls are being recorded, so they may be unlikely to speak about prohibited subjects in a directly obvious manner. Even if prison officials reviewed all of every phone call made, it would be challenging to figure out the meaning of what was being said if part or all of the conversation were essentially in code. There is a need for technological advances which can aid prison officials in detecting phone calls about prohibited subjects, and there is a need for technological advances which can provide prison officials with clues to help decipher conversations which are partly “in code”.
Correctional institutions are not only responsible for preventing inmates from engaging in illegal and/or harmful activities, they are also charged with rehabilitating inmates. One key factor which can aid in rehabilitating inmates is monitoring each prisoner's psychological state of mind. Monitoring inmates' phone calls can give excellent clues to inmates' states of mind, but prisons don't have the budget to have even unskilled personnel monitor the majority of phone calls made, and the level of training and attentiveness that would be required to monitor the majority of phone calls and keep psychological notes is not reasonably feasible for prisons to expend. There is a need for innovative technology and automated systems to help prison officials track the psychological states of mind of inmates.
Another challenge facing prison officials is the challenge of maintaining certainty about who is making which calls. Although many modern prison phone systems require a prisoner to enter a PIN to make calls, it is still possible for inmates to share PINs with each other, which gives them access to dialing numbers which are not on their “allowed phone number” list. There is a need for more reliable ways for prison officials to be able to detect when inmates are directly dialing non-allowed phone numbers by using identifying information of other inmates. It has been proposed to use digital signal processing Speaker Identification techniques (such as those described in U.S. Pat. No. 6,519,561, which is herein incorporated by reference)) in place of PINs to identify which inmate is making a call, but speaker identification technology is nowhere near as reliable as fingerprinting, so such an identification system has not been deemed a viable substitute for PINs.
Speaker recognition technology relies on extracting from human speech certain characteristics of the fundamental vibration rate of the speaker's vocal chords, and certain information about the resonances of various parts of the vocal tract of the person speaking, which are indicative of the physiology of that particular person's vocal tract. There are two problems that lead voiceprints to be far less individuated than fingerprints. The first problem is that there is not as much variation in the physiology of typical people's vocal tracts to provide as rich a differentiation as fingerprints provide. The second problem is that each given person's vocal tract characteristics actually vary in a number of ways depending on time of day, how much the person has been talking that day and how loud, whether or not the person has a cold, etc.
Some modern prison phone systems use voice verification in conjunction with PINs, to make it more difficult for one inmate to falsely identify himself as another inmate. Voice verification has less stringent requirements than voice identification. In voice verification, the system is typically simply ensuring that the person speaking a pass phrase has a voice that is “close enough” to the voice characteristics of the person who's PIN is used. Even with voice verification augmenting PIN usage, one inmate might “share his identity” with another inmate, by entering his PIN and speaking his pass phrase, and then handing the phone off to another inmate. Or an inmate may use a pocket dictation recorder to record another inmate's pass phrase, and then play it into the phone at the appropriate time. There is a need for more robust inmate identification technology which prevents one inmate from “handing off” his identity to another inmate in a way that would allow the other inmate to make calls to numbers which he would otherwise not be allowed to call.
The only data most modern prison phone systems keep track of and make easily searchable are records of numbers dialed, and time, date, and duration of calls, inmate who originated the call, reason for call termination (regular termination, 3-way call termination, out-of-money, etc.), type of call (collect, prepaid, debit, etc.). There is a need for tracking innovative metrics which allow prison officials to more accurately pinpoint which call recordings are worthy of human review, and speech-to-text technologies only partially address this need. It may for instance be desirable to detect when an inmate is giving orders or threatening someone. This may be difficult to do from vocabulary alone, especially since the prisoner knows the call is being monitored, and may therefore speak “in code”. There is also a need for innovative technologies that offer real-time detection of prohibited calls (through detection of non-allowed call participants, and/or through the nature of the dialog between the inmate and the called party or parties), and there is the need for a system which offers prison officials the opportunity to quickly make a decision in real time as to whether a given call should be interrupted, and interrupt the phone call if needed based on real-time content of the call.
Customer Service Market
In the customer service industry, it is common for all calls to be recorded, and for a cross-section of calls to be monitored live and other calls to be reviewed later with the aim of furthering the training of customer service representatives, and increasing customer retention. The increased use of Interactive Voice Response (IVR) systems in the modern customer service industry has in many cases exacerbated the frustration that consumers experience, because one is often communicating with a computer rather than a person when initially calling a customer service department. Some companies have recently made available software designed to detect frustration on the part of consumers dealing with customer service departments. There is a further need for innovative technologies which can aid in real-time detection of live conversations (between customers and customer service agents) that are “not going well”, so that customer service agents have their situational awareness increased, and/or customer service supervisors can intervene, possibly saving valuable customer loyalty. There is also a need for innovative technologies which can give customer service agents feedback and coaching to help them deal more effectively with customers.
As in other industries where large numbers of phone calls are monitored, today's technology makes it easy for a company to record and archive all customer service phone calls, but technologies are lacking in the area of automatically sorting recorded calls and flagging which ones are good candidates to be listened to by persons aiming to glean critical information, or insights which could be used to improve customer service. One challenge facing customer service call center managers is finding a way to usefully keep easily searchable records containing relevant data about recorded phone calls. Companies such as CallMiner, Inc. have begun to make products and services available which use Large-Vocabulary Continuous Speech Recognition (LVCSR) speech-to-text conversion to convert archived audio to text. While today's large-vocabulary continuous speech recognition technology has achieved reasonable accuracy when trained for a particular user, it is far less accurate in converting speech for users for whose speech the system is not trained, and further accuracy problems crop up when converting speech of more than one person in a multi-party conversation. Never the less, products and services such as those offered by CallMiner, Inc. have reached the point where their phrase and word searching functions have been deemed useful by many customer service groups.
In some customer service departments, recording of customer service phone calls also serves the purpose of legal documentation. This is true, for instance, in financial institutions such as banks and brokerages.
Teleclasses, Meetings, Lectures, etc.
Modern technologies such as cell phones and the internet have significantly increased people's ability to be “connected” in a variety of situations. Business meetings, conferences, and classrooms which historically took place only as groups of people coming together face to face are taking place in a variety of new and varied forms, including teleconferences, internet group voice chat sessions, video conferences and classrooms, mixed voice and text group chat sessions, and combinations of these technologies. With ever-increasing pressures for flexibility in business, it has become commonplace for the audio portions of meetings (particularly teleconferences) to be recorded and archived both for record-keeping purposes, and so persons not able to participate live in the meeting can listen to what transpired later (for instance, by cell phone or by downloading an MP3-compressed recording of the meeting to a portable MP3 player, and listening to the meeting while on the go (for instance, while traveling, commuting, jogging, etc.). Distance learning organizations such as Coach University and Coachville routinely record teleclasses and make them available for download or streaming by students (both in RealAudio and MP3 format).
Even with the availability of techniques such as voice logging, and compressed audio streamable and downloadable formats such as RealAudio and MP3, there is a need for new and innovative technologies which will allow persons reviewing audio recordings of classroom sessions, teleconferences, meetings and the like to more rapidly find the portions of the recording that may be of most relevance or interest.
Speech Processing
Computational techniques of converting spoken words to text or phonemes (speech recognition), and techniques for identifying a person by voice automatically (speaker identification) and techniques for automatically verifying that particular person is speaking (speaker verification) typically employ techniques such as spectrographic analysis to extract key features of different people's voices. The following two paragraphs are included to familiarize the unfamiliar reader with some terms and graphical representations used in spectrographic analysis.
A black & white spectrogram of the utterance “phonetician” (the time-domain waveform of which is shown in FIG. 2) is shown in FIG. 3. The spectrogram may be thought of as being composed of a set of vertical stripes of varying lightness/darkness. Each vertical stripe may be thought of as representative of the frequency vs. amplitude spectrum resulting from a Fourier transform of a short time window of the time-domain waveform used to derive the spectrogram. For instance, the spectrum of a short time slice starting 0.15 seconds into the utterance who's spectrogram is depicted in FIG. 3 (representing the spectrum of the beginning of the “o” vowel in “phonetician”) may be represented either by the graph in FIG. 4 or by the vertical stripe 300 of the spectrogram in FIG. 3. The dark bands of vertical stripe 300 may be thought of as representing the peaks of the spectrum in FIG. 4. Thus a spectrogram represents a series of spectral snapshots across a span of time. An alternative way of representing a spectrogram is shown in FIG. 6, where the sequential time slices are assembled in a perspective view to appear as a three-dimensional landscape.
The peaks in the spectrum in FIG. 4 (or equivalently, the dark bands in stripe 300) are referred to as the formants of speech. These peaks fall on harmonics of the fundamental vibration rate of the vocal chords as the speaker pronounces different utterances, and their relative heights and how those relative heights change throughout speech are indicative of the physiology of the particular speaker's vocal tract. Both the fundamental vibration rate of the vocal chords (shown in FIG. 3 over the time span of the utterance of FIGS. 2 and 6) and the relative amplitudes of the speech formants vary over time as any given speaker speaks. Speaker recognition and speaker verification utilize the differences between the spectral characteristics (including variations over time and different utterances) of different peoples voices to determine the likelihood that a particular person is speaking. Various techniques are known in the art for extracting from a speech sample spectral data which may be viewed as indicative of identifying characteristics of the speaking person's vocal tract. Such data is commonly referred to as a voice print or voice signature. The fundamental vibration rate of a given person's vocal chords (and certain other geometric characteristics of that person's vocal tract) can and often do vary with time of day, length of time the person has been continuously talking, state of health, etc. Thus voiceprints are not as invariant as finger prints.
Speech recognition technologies for use in such applications as speech-to-text conversion have been commercially available in products such as Dragon Naturally Speaking™ (made by Nuance Communications Inc.) and ViaVoice™ (made by IBM) for a number of years now, and recently researchers have also begun to develop software for recognizing the emotional content of speech. The word prosody (defined at Princeton University as “the patterns of stress and intonation in a language”) is often used in the field of affective computing (computing relating to emotion) to refer to emotion-indicating characteristics of speech. Prosody measurements may include detecting such speech characteristics as word rate within speech, perceived loudness, sadness, happiness, formality, excitement, calm, etc. Perceived loudness is distinguished here from absolute loudness by the way the character of someone's voice changes when he or she yells as opposed to talking normally. Even if someone used a “yelling voice” quietly, one would be able to understand that the voice had the character of “yelling”. Within this document, we will expand the meaning of the word prosody to include all non-vocabulary-based content of speech, including all emotional tonal indications within speech, all timing characteristics of speech (both within a given person's speech, and timing between one person in a conversations stopping speaking and another person in the conversation speaking), laughter, crying, accentuated inhalations and exhalations, and speaking methods such as singing and whispering. References in which the reader may learn more about the state of the art in prosody detection include:                1) MIT Media Lab Technical Report No. 585, January 2005, which appeared in Intelligent user Interfaces (IUI 05), 2005, San Diego, Calif., USA.        2) R. Cowie, D. Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor. Emotion recognition in human computer interaction. IEEE, Signal Processing Magazine, 2001.        3) P. J. Durston, M. Farell, D. Attwater, J. Allen, H.-K. J. Kuo, M. Afify, E. Fosler-Lussier, and L. C.-H. Oasis natural language call steering trial. In Proceedings Eurospeech, pages 1323-1326, Aalborg, Denmark., 2001.        4) R. Fernandez. A Computational Model for the Automatic Recognition of Affect In Speech. PhD thesis, MIT Media Lab, 2004.        5) H. Quast. Absolute perceived loudness of speech. Joint Symposium on Neural Computation, 2000.        6) M. Ringel and J. Hirschberg. Automated message prioritization: Making voicemail retrieval more efficient. CHI, 2002.        7) S. Whittaker, J. Hirschberg, and C. Nakatani. All talk and all action: Strategies for managing voicemail messages. CHI, 1998.        
The above references are herein incorporated by reference.
Within this document, the terms “voice print”, “voice signature”, “voice print data”, and “voice signature data” may all be used interchangeably to refer to data derived from processing speech of a given person, where the derived data may be considered indicative of characteristics of the vocal tract of the person speaking. The terms “speaker identification” and “voice identification” may be used interchangeably in this document to refer to the process of identifying which person out of a number of people a particular speech segment comes from. The terms “voice verification” and “speaker verification” are used interchangeably in this document to refer to the process of processing a speech segment and determining the likelihood that that speech segment was spoken by a particular person. The terms “voice recognition” and “speaker recognition” may be used interchangeably within this document to refer to either voice identification or voice verification.
In order for the voices of a given person to be identified or verified in voice identification processes, a sample of that person's speech must be used to create reference data. This process is commonly referred to as enrollment, and the first time a person provides a speech sample is commonly referred to as that person enrolling in the system.
There are several ways that voice recognition algorithms can be thought of as testing a given person's voice to see if it matches a previously stored voice print. The first way is that the voice print data can be thought of as a numerical vector derived from the reference speaker's voice. A second numerical vector can be derived in a like manner from the voice under test, and a numerical algorithm can be used to compare the two vectors in a way where the comparison produces a single number that has been found to be indicative of the likelihood of a correct match.
Since the absolute likelihood of a correct match is not independent of the voices of all the people who might be tested who are not a match, a more useful method compares the voice signature of the person being tested to voice signatures from a number of other individuals, or to an average voice signature derived from a number of people. The likelihood that the voice signature under test is the voice that was used to derive the reference voice signature is then derived from the extent to which the voice signature under test matches the reference voice signature better than it matches other individual voice signatures, or the extent to which the voice signature under test matches the reference voice signature better than it matches the “average” voice signature of a population.
A third way that voice recognition algorithms can be thought of as testing a given person's voice to see if it matches a previously stored voice print is that the stored voice print may be thought of as a model which is repeatedly tested against over time using small samples of the voice under test, and the resulting test scores are averaged over time. This procedure may be used with one of the above methods to produce a likelihood score which has more certainty the longer the speech under test is listened to. This variable sample length method may have advantages in live monitoring applications and in applications where it is desirable not to waste computational resources once a desired certainty level has been attained.