Telephone conversations are frequently recorded or otherwise monitored in controlled-environment facilities. In prisons, for instance, such monitoring is important because inmates have been known to orchestrate crimes over the telephone. Generally, a resident's conversations may be recorded or monitored in order to maintain a record for future review, to determine where a resident may be hiding after escaping, to use as evidence in connection with crimes committed, to detect a planned future crime or other activity that is of interest (e.g., riot, escape attempt, suicide attempt, etc.), among other reasons. Unfortunately, authorities can only listen to a fraction of a day's recordings each day. Moreover, this extremely labor intensive task is highly subject to human error. One may misinterpret crucial elements or keywords spoken during a monitored conversation, or miss important conversations altogether. Further, the effectiveness of a human operator in recognizing keywords is limited by the knowledge of the specific human operator as to keywords to listen for, and because the knowledge and experience level may vary from one human operator to the next, so may the effectiveness of the monitoring vary. Also, the use of a human operator in monitoring conversations introduces the possibility of discrimination or biases of the specific human operator impacting the human operator's analysis of the conversations, whether intentional or not.
Meanwhile, automatic speech processing systems have been developed in the art. Conventional speech processing systems commonly employ a speech recognition module, which transforms captured audio signals into discrete representations that are compared to stored digital representations of expected keywords or sounds. Words spoken during a conversation can be recognized by using statistical algorithms or phonetic-based algorithms that measure and detect matches to corresponding keywords. Nevertheless, because different speech recognition applications face different practical challenges, the design of such systems can vary widely according to vocabulary, syntax and, more importantly, the environment where the system is being used. Further, the accuracy of such systems may depend on many factors, and the accuracy may substantially degrade in environments having low audio quality and/or a wide variety of speakers (e.g., with different dialects, accents, slang, etc.) who are not motivated to cooperate in the capture of their conversations. For instance, in certain situations, individuals are motivated to cooperate with a speech recognition system to aid in improving the accuracy of the speech recognition system. In some situations, an individual desiring to utilize a speech recognition system to, for example, transcribe dictation from the individual into a word processor document may “train” the speech recognition system (e.g., by reading aloud certain words, phrases, documents, etc. that are known by the speech recognition system, such that the speech recognition system can adapt its operation to the individual user's specific speaking patterns, dialect, accent, etc.). In many other situations, users are motivated to speak clearly to improve accuracy of the speech recognition systems, such as when users are interacting with a voice response unit (VRU) to navigate a menu presented via a telephony system. Of course, in many environments, such as prisons, the individuals have no motivation to speak clearly or otherwise cooperate to have their conversations accurately processed by the speech recognition system, and may not even be aware that the conversation is being captured and processed by a speech recognition system. Further, prison inmates generally do not cooperate to “train” a speech recognition system. These and other factors substantially increase the difficulty in accurately detecting keywords by a speech processing system in many environments, such as prisons.