1. Field of the Invention
The present invention relates to systems and methods for reducing speech intelligibility while preserving environmental sounds, and more specifically to identifying and modifying vocalic regions of an audio signal using a vocal tract model from a prerecorded vocalic sound.
2. Background of the Invention
Audio communication can be an important component of many electronically mediated environments such as virtual environments, surveillance, and remote collaboration systems. In addition to providing a traditional verbal communication channel, audio can also provide useful contextual information without intelligible speech. In certain situations (elder care, surveillance, workplace collaboration and virtual collaboration spaces) audio monitoring that obfuscates spoken content to preserve privacy while allowing a remote listener to appreciate other aspects of the auditory scene may be valuable. By reducing the intelligibility of the speech, these applications can be enabled without an unacceptable loss of privacy.
In situations which involve remote monitoring such as security surveillance, home monitoring of the elderly, or always-on remote awareness and collaboration systems, people often raise privacy concerns. Video monitoring has been noted to be intrusive by elderly people. Kelly Caine, “Privacy Perceptions of Visual Sensing Devices: Effects of Users' Ability and Type of Sensing Device,” M.S. thesis, Georgia Institute of Technology, 2006. http://smartech.gatech.edu/dspace/handle/1853/11581. In the security scenario, sounds such as glass breaking, gunshots, or yelling are indicative of events that should be investigated. In the elder care scenario, examples of sounds which might indicate intervention is needed are a tea kettle whistling for a long time, the sound of something falling, or the sound of someone crying. Therefore, it is desired to develop a system for monitoring audio signals that balances the privacy interests of the recorded speaker but also provides needed environmental and prosodic information for security and safety monitoring applications.
Remote workplace awareness is another scenario where an audio channel that gives the remote observer a sense of presence and knowledge of what activities are occurring without creating a complete loss of privacy can be valuable.
Cole et al. studied the influence of consonants and of vowels on word recognition using a subset of the sentences in the TIMIT corpus. R. A. Cole, Yonghong Yan, B. Mak, M. Fanty, T. Bailey. “The contribution of consonants versus vowels to word recognition in fluent speech,” Proc. ICASSP-96, vol. 2, pp. 853-856, 1996, and John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, Philadelphia http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1. They tried manually substituting noise for different types of sounds, such as consonants only and vowels only, and let subjects listen to each sentence up to five times. They found that when only vowels were replaced with noise, their subjects recognized 81.9% of the words and recognized all the words in a sentence 49.8% of the time. They found that when vowels plus weak sonorants (e.g.: l, r, y, w, m, n, ng) were replaced with noise, their subjects recognized 14.4% of the words on average, and none of the sentences were completely correctly understood.
Kewley-Port et al. (2007) did a follow-on study to the first condition in Cole et al. (1996) where only vowels are manually replaced with shaped noise. Diane Kewley-Port, T. Zachary Burkle, and Jae Hee Lee, “Contribution of consonant versus vowel information to sentence intelligibility for young normal-hearing and elderly hearing-impaired listeners,” The Journal of the Acoustical Society of America. Vol. 22(4), pp. 2365-2375, 2007. In contrast to Cole et al., subjects were allowed to listen to each sentence up to two times. Their subjects performed worse in identifying words in TIMIT sentences, with 33.99% of the words correctly identified per sentence, indicating that being able to listen to sentence more than twice may improve intelligibility.
Kewley-Port and Cole both found that when only vowels are replaced by noise, intelligibility of words is reduced. Cole additionally found that replacing vowels plus weak sonorants by noise reduces intelligibility so that no sentences are completely recognized and only 14.4% of the words are recognized.
For audio privacy, it is desired to reduce the intelligibility of words to less than 14.4%, and ideally as to close to 0% as possible, while still keeping most environmental sounds recognizable and keeping the speech sounding like speech.