In the field of the surveillance and analysis of environments, the conventional systems known from the prior art are mainly based on image and video technologies.
In the field of surveillance using audio, the technical problems involved are notably as follows:                1) how can you detect specific and/or abnormal sound events?        2) how can you have solutions that are robust to the background noise and its variabilities, that is to say solutions that are reliable and that do not generate alarm signals continually or in an untimely manner?        3) how do you classify the various events recorded?        
In the field of the analysis of sound events, the prior art distinguishes two processes. The first is a detection process. The second is a detected events classification process.
The conventional detection methods for sound events are generally based on the extraction of parameters characteristic of the signals that are to be detected. The parameters are, usually, time-related, frequency-related or time/frequency-related parameters.
In the case of the classification methods known from the prior art, these are usually based on so-called supervised approaches, in which a model for each event to be classified is obtained from segmented and labeled learning data. These solutions are based, for example, on classification algorithms known by the abbreviations HMM for Hidden Markov Model, GMM for Gaussian Mixture Model, SVM for Support Vector Machine, or even NN for Neural Network. These models are known to those skilled in the art and will not be detailed. The proximity of the real test data and of the learning data conditions the performance levels of these classification systems.
The major drawbacks to the supervised approach stem from the need to specify the abnormal events first, and to collect a sufficient and statistically representative quantity of these events. Specifying the events is not always possible, nor is collecting a sufficient number of productions to enrich a database. Also, a new supervised learning is necessary for each new configuration. The supervision task requires human intervention (manual or semi-automatic segmentation, labeling, etc.). The flexibility of these solutions is therefore limited in terms of usage, and the recognition of new environments is difficult to implement. Finally, learning event models takes account of the background noise and its variability, so, as a matter of fact, it may in certain cases not be robust. These approaches can be regarded as non-automated approaches, that is to say approaches that require human intervention.
Despite all the results that these systems give, the solutions from the prior art do not make it possible to correctly process the audio events that are not predefined. The robustness to the environment and its variability are limited.