Speech masking systems that are used to increase working comfort are well known in the art. However, such systems are inefficient to provide speech privacy. Most of the known systems are primarily intended to increase the working comfort, but speech privacy is considered as being secondary.
When only considering the acoustic scene reproduced by a telecommunication device, the reproduction can also be restricted to the clear speech zone by means of beamforming or multi zone reproductions. However, beside the effort through the high number of loudspeakers that may be used, such system will never achieve speech privacy at a sufficient level, since the achieved absolute sound pressure level in the masked speech zone is still well above the hearing threshold of humans. The same holds for active noise cancellation/control approaches, which could potentially not only cancel any signal reproduced but also local human speakers. Moreover, those techniques involve the use of possibly multiple microphones and the adaptive filtering that may be used is a task known to be challenging (Stephen J. Elliott and Philip A. Nelson: Active noise control. In: Signal Processing Magazine, IEEE, 10(4): 12-35, 1993). Eventually, active noise control has only been successfully used for low-frequency sound sources or simple scenarios like ventilation ducts (Stephen J. Elliott and Philip A. Nelson: Active noise control. In: Signal Processing Magazine, IEEE, 10(4): 12-35, 1993).
A widely used method is to generate a masking sound (masker) that cannot be distinguished (i.e. perceptually separated) from the speech (maskee) such that comprehension of the speech is inhibited in presence of the masking sound. Often the term sound masking is used for such systems, since usually some kind of masker sound is played back in a specified area. An approach is to reproduce air-condition-like background noise. This noise overlays the speech and helps to render it unintelligible. While such masking could be achieved by playing back very loud masking sounds, sound masking techniques intend to use a decent masker at a sound level as low as possible.
Often a white noise or a pink noise is used, which at low playback levels is not very effective for masking speech to such a degree that speech privacy can be achieved. Previously proposed methods to enhance the masking effect of induced noise are summarized in the following.
In Bill G. Watters, Michael Nacey and Thomas R. Horrall: Process and apparatus for speech privacy improvement through incoherent masking noise sound generation in open-plan office spaces and the like. U.S. Pat. No. 4,059,726, 1977, incorporated by reference herein, the authors cite from literature that sounds with an unobtrusive character and frequency spectrum, such as wind or wave sounds are suited to achieve speech privacy. This document also states that a sound is more intrusive if the place of its origin can be localized by the listener. A uniform unlocalizable distribution of the masking noise has been found to be advantageous in some scenarios. Therefore, Bill G. Watters, Michael Nacey and Thomas R. Horrall: Process and apparatus for speech privacy improvement through incoherent masking noise sound generation in open-plan office spaces and the like. U.S. Pat. No. 4,059,726, 1977, incorporated by reference herein, proposes the use of multiple decorrelated noise sources to generate a diffuse, uniform, delocalized sound space.
It has been found to be advantageous if the level of the masking sound varies adaptively corresponding to e.g. the surrounding environment characteristics, or the level of the speaker's voice that should be masked (see e.g., Jeffrey Specht, Daniel Mapes-Riordan, and William DeKruif: Method and apparatus of overlapping and summing speech for an output that disrupts speech. U.S. Pat. No. 7,376,557, 2008, incorporated by reference herein; and Andre L. Esperance and Alex Boudreau: Auto-adjusting sound masking system and method. U.S. Pat. No. 7,460,675, 2008, incorporated by reference herein. Also the automatic adaption of the masker's spectral characteristics in addition to level adaption is known to be beneficial (see e.g. Richard O. Thomalla: Automatic volume and frequency controlled sound masking system. U.S. Pat. No. 4,438,526, 1984, incorporated by reference herein and Andre L. Esperance and Alex Boudreau: Auto-adjusting sound masking system and method. U.S. Pat. No. 7,460,675, 2008, incorporated by reference herein. Rafik Goubran and Radamis Botros: Adaptive sound masking system and method. United States Patent Application No.: US 2003/0103632, 2003, incorporated by reference herein, proposes in this respect: “An adaptive sound masking system and method portions undesired sound into time-blocks and estimates frequency spectrum and power level, and continuously generates white noise with a matching spectrum and power level to mask the undesired sound.”
Other applications generate specific noise shapes that have the ability to mask speech specifically good (Kenneth P. Roy, Thomas J. Johnson, Ronald Fuller and Steve Dove: Architectural sound enhancement with pre-filtered masking sound. U.S. Pat. No. 7,548,854, 2009, incorporated by reference herein), or produce masking noise that “closely matches the characteristics of the source (person speaking)” (Jeffrey Specht, Daniel Mapes-Riordan, and William DeKruif: Method and apparatus of overlapping and summing speech for an output that disrupts speech. U.S. Pat. No. 7,376,557, 2008, incorporated by reference herein). The latter methods, with the specific aim of rendering speech unintelligible, have been proposed using a masking sound that closely resembles speech utterances by either artificially generating alike sounds, or playing back random concatenations of utterances from a database (see e.g. Jeffrey Specht, Daniel Mapes-Riordan, and William DeKruif: Method and apparatus of overlapping and summing speech for an output that disrupts speech. U.S. Pat. No. 7,376,557, 2008, incorporated by reference herein and Babak Arvanaghi and Joel Fechter: Method and apparatus for masking speech in a private environment. United States Patent Application No.: US 2013/0185061, 2013, incorporated by reference herein. Jeffrey Specht, Daniel Mapes-Riordan, and William DeKruif: Method and apparatus of overlapping and summing speech for an output that disrupts speech. U.S. Pat. No. 7,376,557, 2008, incorporated by reference herein, uses speech sounds to make the masking sound unobtrusive. However, this may still be distracting e.g. for a driver who is exposed to that sound.
Other methods that have been proposed to achieve speech privacy are e.g. the generation of cancelation signals that try to eliminate the target speech at an intended location. Japanese patent application Nakamura lkuya and Ogiwara Takashi: Speech privacy protective device. Japanese Patent Applications Nos.: JP 3377220 and JP 5011780, 1991 discloses such a speech privacy protection device for vehicle cabins. The conversation is captured, and a cancelation sound is fed to the position where the conversation should not be heard.
Depending on the application, often the masking noise is reproduced either in a large area around the talker, or produced near the talker itself (see Jeffrey Specht, Daniel Mapes-Riordan, and William DeKruif: Method and apparatus of overlapping and summing speech for an output that disrupts speech. U.S. Pat. No. 7,376,557, 2008, incorporated by reference herein, and Robert Bailey, Lawrence Heyl, and Stephan Schell: Systems and methods for altering speech during cellular phone use. United States Patent Application No.: US 2009/0171670, 2009, incorporated by reference herein), or the zones are (additionally) separated by physical means (Mai Koike, Yasushi Shimizu, Masato Hata and Takashi Yamakawa: Masker sound generation apparatus and program. United States Patent Application No.: US 2011/0182438 A1, 2011, incorporated by reference herein). Chatter Blocker (see www.chatterblocker.com) is an application with masking sounds from different categories (sound effects, music chatter voice) which can be played individually or combined, and adjusted in level by the user. It uses the built-in loudspeaker of the playback device (e.g. a tablet), or external loudspeakers connected to the playback device.