Acoustic Scene Classification (ASC) is the term given to technology that aims to recognise a type of an environment just from sounds recorded at that place. Those sounds might be sounds occurring at the specific environment and/or the sounds that environment produces itself. ASC can be considered as the task of associating a semantic label with an audio stream that identifies a particular environment. Examples of commonly used labels include car, office, street, home, restaurant etc.
The ASC process is typically divided into a training phase and a classification phase. First, a feature vector derived from each audio instance representing a specific acoustic scene in a training set is used to train a statistical model that summarises the properties of soundscapes belonging to the same category (as shown in FIG. 1). The classification phase then involves extracting the same features from an unknown audio sample. Based on these two inputs, statistical model and feature vector, the unknown audio sample is classified into the category that matches it best (as shown in FIG. 2).
An important part of ASC is defining and extracting audio features that characterise a signal as being a type of signal that has been acquired from a particular environment. Current state-of-the-art ASC systems exploit several categories of audio features, including frequency-band energy features, voicing features and detected events, to classify recorded sounds. A problem with this approach is that it relies on the right sounds being made at the right time. If the type of sounds (acoustic events) that usually occur in a specific environment are not occurring for some reason, or are being drowned out by other sounds, when the recording is made, there is a risk that the ASC process may wrongly classify the environment.