Detecting the presence of speech in audio recording is useful for a variety of applications such as recording systems, Voice over Internet Protocol (VoIP) applications, speech-to-text applications and others. For example, a speech detection mechanism may be used in recording systems to avoid recording and archiving silent audio streams and to alert users if speech is not present in a recording. In VoIP applications, detection of human speech may help avoid unnecessary processing and transmission of silent packets. Speech-to-text algorithms are usually very processing-intensive, so when the speech detector determines that there is no speech in a recording, it omits the need for transcription. This may save a lot of unnecessary processing.
Detecting the presence of speech in audio recording is particularly important for a recording system that needs to provide a proof that all conversations are recorded, based on regulations for compliancy. On trading floors, recording functionality has the highest priority because trading is not allowed when the recording functionality has failed or has been compromised. Absent the ability to detect presence of speech, systems may be recording noise or silence unknowingly, and therefore break compliancy regulations without informing the user.
Current speech detection algorithms are either not accurate or require complex analysis of the audio signal. Speech detection algorithms that require relatively low computational power are not very flexible or fault-tolerant. These algorithms may be sensitive to the audio quality. Changes in noise level, bandwidth, DC offset (e.g., changes in the mean value of the audio signal), dynamic range, clipping and distortion may affect speech detection results. These algorithms may only provide a Boolean output, either speech is present or not, without giving indication for the amount of speech in the audio stream. On the other hand, the more accurate and robust algorithms are computationally intensive since they require complex frequency analysis, phonetic comparison, or other computationally intensive calculations.
Thus, current accurate speech detection algorithms are typically very computational intensive, which may limit their wide implementation in systems that have limited computing power. For example, recording systems may be required to analyze and record thousands of channels concurrently. Thus, either the detection mechanism cannot be executed in real-time with the audio stream recording, or when in use, it strongly reduces the amount of possible concurrent recordings.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.