This invention relates to a telephone employing circuitry for echo cancellation and noise reduction and, in particular, to such circuitry that includes a music detector.
As used herein, “telephone” is a generic term for a communication device that utilizes, directly or indirectly, a dial tone from a licensed service provider. As such, “telephone” includes desk telephones (see FIG. 1), cordless telephones (see FIG. 2), speakerphones (see FIG. 3), hands-free kits (see FIG. 4), and cellular telephones (see FIG. 5), among others. For the sake of simplicity, the invention is described in the context of telephones but has broader utility; e.g. communication devices that do not utilize a dial tone, such as radio frequency transceivers. Although described in the context of telephones, the invention has broader application in the analysis of audio signals.
While not universally followed, the prior art generally associates noise “suppression” with subtracting a signal from the signal of interest and associates noise “reduction” with attenuation or reduced gain. Noise reduction circuitry is generally part of a non-linear processor.
There are many sources of noise in a telephone system. Some noise is acoustic in origin while other noise is electronic, from the telephone network, for example. As used herein, “noise” refers to any unwanted sound, whether the unwanted sound is periodic, purely random, or somewhere in-between. As such, noise includes background music, voices of people other than the desired speaker, tire noise, wind noise, and so on. As thus broadly defined, noise could include an echo of the speaker's voice. However, echo cancellation is treated separately in a telephone.
There are two kinds of echoes in telephones, an acoustic echo from the path between an earphone or a speaker and a microphone and a line echo generated in the switched network for routing a call between stations. Echo cancellation involves subtracting a simulated echo from an input signal. The simulated echo is created by filtering an output signal with an adaptive filter. The adaptive filter is programmed to represent either the near-end path (speaker to microphone) or the far end path (line out to line in) to create the simulated echo.
Noise is subjective, somewhat like a weed. It depends upon what one wants or does not want. In this description, noise is unwanted sound from the perspective of a person trying to converse on a telephone. For example, in a vehicle, noise includes road noise, music from a radio, background conversation, and the sound from the speaker element in a hands-free kit. The desired signal is usually only the voice of the person speaking.
If there is significant amount of background noise, it is usually desirable to reduce the background noise to improve intelligibility. On the other hand, a person may be at a musical concert and it may be desirable to allow the music to pass through the telephone network unaffected. To satisfy these contradictory conditions, one needs a special algorithm to distinguish between noise and music.
It is known in the art to distinguish music from speech; see, for example, Carey, Michael J. et al., Comparison of Features for Speech, Music Discrimination, IEEE publication 0-7803-5041-3/99 © 1999. It is also known to distinguish music, speech, and noise; see, for example, G. Lu & T. Hankinson, “A Technique towards Automatic Audio Classification and Retrieval,” 1998 Fourth Signal International Conference on Signal Processing Proceedings (ISCP-98), Beijing, China 1998. Spectral flatness measure (SFM) is known in the art; see, for example, U.S. Pat. No. 5,648,921 (Bayya et al.) and U.S. Pat. No. 6,477,489 (Lockwood et al.). As used herein, SFM is defined differently from these two patents, which define SFM differently from each other. The differences are in form, not substance.
One of the main challenges in distinguishing music from noise is that the envelopes of both types of signal are relatively constant. Most known voice activity detectors measure the energy content of the envelope, which means that a voice activity detector will detect music as noise and will cause the noise reduction circuitry to reduce the background music, distorting the signal. It will also cause the non-linear processor to suppress the residual echo, which will then insert the comfort noise after suppressing the residual echo. This insertion of comfort noise can annoy a listener because the music will become intermittent. A similar effect can occur in echo canceling systems.
Music is generally characterized by a finite amount of energy at all times, some music having a relatively constant envelope and some not. Most of the acoustic energy in music is below 8 kHz, although rock and hard rock are almost like white noise. The spectral content of music changes frequently, depending upon the rhythm of the music. Based on these characteristics, certain features are selected and several different algorithms are being investigated in the art for classifying sound. Examples are in the literature identified above.
Possible methods for classifying audio signals include envelope detection, linear prediction analysis, zero crossing detection, Bark band spectral analysis, auto-correlation, silence ratio, tracking spectral peaks, and differential spectrum (changes in spectral content from instant to instant). Silence ratio is really an amplitude comparison. A signal is divided into time segments. A signal having an amplitude less than a threshold is silence. The ratio is the number of silent segments divided by the total number of segments. Speech signals have a higher silence ratio than music. Noise and non-speech are problems, as is picking the correct time interval.
Many of these methods are not robust enough to distinguish different genre of music unambiguously from noise. Some of the methods are not meant to be done in real time because of large computational requirements; e.g. requiring wide data bus, large amounts of storage, or long execution time for analysis. Hence, it is desirable to provide a method that can unambiguously distinguish mainstream music genre with small computational requirements.
In view of the foregoing, it is therefore an object of the invention to provide a method for unambiguously distinguishing mainstream music genre from noise.
Another object of the invention is to provide a method for unambiguously distinguishing mainstream music genre from noise while requiring little computational power.
A further object of the invention is to provide a method for unambiguously distinguishing mainstream music genre from noise in real time.