Spoken language is the most natural mode of communication for mankind. The dream of voice interaction between man and machine appeared very soon after the automation of man-machine communication.
With this aim in view, research into automatic speech recognition (voice recognition) systems began as early as the 1950s, and many technical applications now use such systems, such as direct voice-to-text dictation and interactive telephone voice services. Since the outset, technical problems associated with voice recognition have continually evolved, in particular with the expansion of telephony.
A voice recognition system conventionally comprises a speech detection module and a speech recognition module. The function of the detection module is to detect periods of speech in an input audio signal, in order to avoid the recognition module attempting to recognize speech in periods of the input signal corresponding to silence. The speech detection module therefore improves performance and also reduces the cost of the voice recognition system.
The operation of a module for detecting speech in an audio signal, usually implemented in the form of software, is conventionally represented by a finite state machine also known as an automaton.
A change of state of a detection module is typically conditioned by a criterion that is based on obtaining and processing information relating to the energy of the audio signal. A speech detection module of this kind is described in the doctoral thesis “Amélioration des performances des serveurs vocaux interactifs” [“Improving performance of interactive voice servers”] by L. Mauuary, Université de Rennes 1, 1994.
In the particular context of voice recognition for telephone applications, attention is focused at present on recognizing a large number of isolated words (for a voice directory, for example), recognizing continuous speech (i.e. phrases of everyday language), and signal transmission/reception in a noisy environment, for example in mobile telephony.
However, in this context, the performance of current detection systems remains highly inadequate, particularly when the background noise is of short duration, in which case speech detection errors can lead to voice recognition errors that are very disturbing for the user. Also, the settings of existing detection systems are highly sensitive to the conditions and the nature of the telephone call (fixed telephony, mobile telephony, etc.).