Voice communication devices such as Cell phones, Wireless phones, Bluetooth Headsets, Hands-free devices, ASR and MoH devices have become ubiquitous; they show up in almost every environment. These systems and devices and their associated communication methods are referred to by a variety of names, such as but not limited to, cellular telephones, cell phones, mobile phones, wireless telephones in the home and the office, and devices such as Personal Data Assistants (PDAs) that include a wireless or cellular telephone communication capability. They are used at home, office, inside a car, a train, at the airport, beach, restaurants and bars, on the street, and almost any other venue. As might be expected, these diverse environments transmit different kinds of signals which include, but not limited to, speech only, speech with background noise, music only, speech with background music, as well as other combinations of sounds.
A primary objective is to provide means to efficiently retrieve information from global network of digital media which include mobile phones, internet, T.V., radio and other systems.
As the communication network grows, consumers will demand specific multimedia material stored in the digital media servers. Data mining tools may be used to browse the servers and download specific speech or music, hence the desire to classify speech and music.
Humans can easily discriminate speech and music by listening to a short segment of signal. A real-time speech/music discriminator proposed by Saunders [1] is used in radio receivers for the automatic monitoring of the audio content in FM radio channels. In conference bridge, Music on Hold applications, it is necessary to disable noise reduction during music durations. Another area of application is ASR. It is important to disable speech recognizer during non-speech and music durations. This can save power for mobile devices.
The speech/music classifiers have been studied extensively and many solutions have been proposed for cell phone, Bluetooth headsets, ASRs, MoH and Conference bridge applications.
Depending upon the particular application, the speech/music classification can be done offline or in real-time. For real-time applications, like Music on Hold, Conference Bridge applications, the method must have low latency and low memory requirements. For offline applications, the constraints on processing speed and memory requirements can be relaxed.
Current speech/music classifier solutions use data from multiple features of an audio signal as input to a classifier. Some data is extracted from individual frames while the other data is extracted from the variations of a particular feature over several frames. An efficient classifier can be achieved only if the speech and music can be detected reliably, consistently and with low error rates.
Several different kinds of speech/music classifiers are known in the related art which extract information based on the nearest-neighbor approach, including a K-d tree spatial partitioning technique.
U.S. Pat. No. 2,761,897 by Jones discloses a discriminator system where rapid drops in the level of an audio signal are measured. If the number of changes per unit frame crosses a particular threshold, the audio signal is labeled as speech. However, it uses a hardware approach to discriminate between speech and music.
U.S. Pat. No. 4,542,525 by Hopf discloses a logic circuit which uses the number of pauses and the time span of simultaneous or alternating appearance of signal pauses derived from the two different pulse sequences. The Hopf invention also employs a hardware solution.
Software solutions like US patent 2005/0091066 A1 by Singhal employ the usage of a zero point crossing counter for classifying speech and music. If the number of zero crossings exceeds a pre-determined threshold value, the incoming signal is considered music. However, this technique is not suitable for windy conditions which have high zero crossing rates.
It is an objective of the present invention to provide methods and devices that overcome disadvantages of prior schemes. Hence there is a need in the art for a method of speech/music discriminator that is robust, suitable for mobile use, and computationally inexpensive to integrate/manufacture with new/existing technologies.