Transients, or sounds such as keyboard typing and door knocking, often arise as an interference in everyday applications involving audio signals, including hearing aids, hands-free accessories, mobile phones, and conference-room devices. Typically, these transients consist of an initial peak followed by decaying short-duration oscillations of length ranging from 10 ms to 50 ms. Unfortunately, the wide spread assumption of stationary noise poses a major limitation on traditional speech enhancement algorithms. In particular, it makes them inadequate in transient interference environments, as transients are characterized by a sudden burst of sound. Current speech enhancement algorithms fail to deal with transient interferences, since their noise estimation components are not designed to track the rapid variations characterizing such transients.
An algorithm has previously been proposed (Talmon, et al., 2011, IEEE Transaction on Audio, Speech and Language Processing, 19(6):1584-1599; Talmon, et al., 2010, Proc. 35th IEEE Internet Conf. Acoust. Speech and Signal Process. (ICASSP-2010), Dallas, Tex., March 2010) that infers the geometric structure of the transient interference using nonlocal (NL) diffusion filtering (L. P. Yaroslayski, Digital Picture Processing, Springer-Verlag, Berlin, 1985; Barash, 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:844-847; Buades, et al., 2005, Multiscale Model. Simul., 4:490-530; Mahmoudi and Sapiro, 2005, IEEE Signal Processing Letters 12:839-842; Szlam, et al., 2008, J. Mach. Learn. Res 9:1711-1739; Singer, et al., 2009, SIAM Journal Imaging Sciences, 2(1):118-139). The key idea was to exploit the intrinsic transient structure, instead of relying on estimates of noise statistics. It was noted that a distinct pattern appears multiple times. Specifically, the locations of the repeating pattern were implicitly identified, and the transient interference was extracted by averaging over all these instances. This work was improved and extended to support a wider variety of transient interferences (Talmon, et al., 2011, IEEE Trans. Audio, Speech Lang. Process. 21(1):132-144; Talmon, et al., Proc. 36th IEEE Internet Conf Acoust. Speech and Signal Process. (ICASSP-2011), Prague, Czech Republic, May 2011). A robust approach to distinguish between transients and speech was employed based on the observation that speech components are slowly varying with respect to transient interferences, just as pseudo-stationary noise is slowly varying with respect to speech. In addition, a manifold learning approach termed diffusion maps was utilized to compute a robust intrinsic metric for comparison (Coifman 2006, Appl. Comput. Harmon. Anal., 21:5-30). It enabled the clustering of different transient interference types, and when incorporated into the NL filter, it provided a better affinity metric for averaging over transient instances.
Recently several supervised speech enhancement algorithms, which rely on the prior knowledge of the typical interference patterns, have been proposed (Smaragdis, 2007, IEEE Tran. on Audio, Speech and Language Processing, 14(1):1-12; Wilson, et al., 2008, Proc. 33th IEEE Internet Conf. Acoust. Speech and Signal Process. (ICASSP-2008), Las Vegas, Nev., 14:4029-4032; Mohammadiha, et al., 2011, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics pg. 45-48). In these algorithms, nonnegative matrix factorization (NMF) is employed to compute a basis for the interferences, which is then utilized to enhance the speech and suppress the noise. However, these algorithms suffer from several limitations. They require training recordings of both the interference and the speech (Wilson, et al., 2008, Proc. 33th IEEE Internet Conf. Acoust. Speech and Signal Process. (ICASSP-2008), Las Vegas, Nev., 14:4029-4032), which makes the algorithms speaker-dependent. In addition, the application of NMF is required for every new measurement and its computational burden is high. Finally, when applied to enhance speech and suppress noise (Mohammadiha, et al., 2011, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics pg. 45-48), a temporal smoothing is applied which makes the algorithm inadequate for transient interferences.
Additionally, prior art systems for reducing or suppressing transients in an audio signal are described in the patent literature. For example, EP 1775719 describes a voice enhancement system for suppressing transient road noise; U.S. Pat. No. 7,869,994 describes a transient noise removal system using wavelets; and US 2012/0076315 describes a system for repetitive transient noise removal. Further, some patent literature discloses systems or methods for removing or reducing noise produced by keyboards, or another user-operated device, from an audio signal. For example, EP 2294697 describes a method for reducing keyboard noise in conferencing equipment (also published as U.S. Pat. No. 8,295,502), and EP 2494550 describes a method for suppressing noise in an audio signal created by a user operating a computer. However, the methods in the above disclosures are based on defining a model of potential transients or noise using information external to the noise-containing audio signal. For example, the prior art methods may create a model using information from previously analyzed signals that provide general characterizations of potential transients, or these methods may use information external to the audio signal, such as identifying noise by determining the timing of keystrokes or other user activity.
Thus, there is a need in the art for a system and method of transient interference suppression for providing accurate and efficient speech enhancement, particularly when real-time online processing is desired. The present invention satisfies this need.