When seeking to determine the actual start and end of speech, various solutions can be envisioned:
(1) It is possible to work with the instantaneous amplitude by reference to an experimentally determined threshold and confirm the speech detection by a detection of voicing (see article "Speech--noise discrimination and its applications" by V. Petit/F. Dumont, which appeared in the THOMSON-CSF Technical Magazine--Vol. 12--No. 4, Dec. 1980). PA1 (2) It is also possible to work with the energy of the total signal over a time slice of duration T, by thresholding this energy, still experimentally, with the aid of local histograms, for example, and then to confirm subsequently with the aid of a voicing detection, or of the calculation of the minimum energy of a vowel. The use of the minimum energy of a vowel is a technique described in the report "AMADEUS Version 1.0" by J. L. GAUVAIN of the LIMSI laboratory of the CNRS. PA1 (3) The preceding systems allow detection of voicing, but not of the actual start and end of speech, that is to say the detection of unvoiced fricative sounds (/F/, /S/, /CH/) and unvoiced plosive sounds (P/, /T/, /Q/). It is therefore necessary to supplement them by an algorithm for detecting these fricatives. A first technique may consist in the use of local histograms, as recommended by the article "Problem of detection of the boundaries of words in the presence of additive noise" by P. WACRENIER, which appeared in the PhD thesis from the PARIS-SUD university, Centre d'Orsay.
Other techniques close to the preceding ones and relatively close to that set out here have been presented in the article "A Study of Endpoint Detection Algorithms in Adverse Conditions: Incidence on a DTW and HMM Recognizer" by J. C. JUNQUA/B. REAVES/B. MAK, during the EUROSPEECH Congress, 1991.
In all these approaches, a large part is done heuristically, and few powerful theoretical tools are used.
Works on noise removal from speech, similar to those presented here, are much more numerous, and mention will be made in particular of the book "Speech Enhancement" by J. S. LIM in the Prentice-Hall Signal Processing Series publications "Suppression of Acoustic Noise in Speech Using Spectral Subtraction" by S. F. BOLL, which appeared in the magazine IEEE Transactions on Acoustics, speech and signal processing, Vol. ASSP-27, No. 2, April 1989, and "Noise Reduction for Speech Enhancement in Cars: Non-Linear Spectral Subtraction/Kalman Filtering" by P. LOCKWOOD, C. BAILLARGEAT, J. M. GILLOT, J. BOUDY, G. FAUCON which appeared in the EUROSPEECH 91 magazine. Only techniques for noise removal in the spectral domain will be quoted, and mention will be made in the rest of the text of "spectral" noise removal by use of this language.
In all these works, the close relationship between detection and noise removal is never really brought into the open, except in the article "Suppression of Acoustic Noise in Speech using detection and raction", which proposes an empirical solution to this problem.
However, it is obvious that removal of noise from speech, when two recording channels are not available, necessitates the use of frames of "pure" noise, which are not contaminated by speech, which makes it necessary to define a detection tool capable of distinguishing between noise and noise+speech.