This application, and the innovations and related subject matter disclosed herein, (collectively referred to as the “disclosure”) generally concern systems for interpreting acoustic scenes and associated techniques. More particularly but not exclusively, disclosed acoustic-scene classification systems and associated processing techniques can be incorporated in acoustic-scene interpretation systems. For example, a disclosed acoustic-scene interpretation system can have a module configured to classify an observed acoustic scene according to one or more selected classes of acoustic scenes, with speech, music, vehicle traffic, and animal activity being but particular, non-limiting, examples of acoustic-scene classes. Some disclosed interpretation systems that classify observed acoustic scenes can select an acoustic recognition engine (or module) suitable for interpreting the observed class of acoustic scene, improving computational efficiency. And, some disclosed classification modules incorporate event-duration models to inform an assessment of an observed acoustic scene. Such duration models can improve the accuracy and computational efficiency of classification systems and acoustic-scene interpretation systems as compared to previously proposed systems.
As used herein, the phrase “acoustic scene” means an event or a sequence of events giving rise to an acoustic signal diffused in an environment associated with the event or sequence of events. Generally speaking, acoustic-scene interpretation concerns assessing or determining, or otherwise inferring, information about the event or sequence of events and/or the associated environment forming a given acoustic scene from observations of the diffused acoustic signal. Such information, which can vary in time, can be used to associate the acoustic scene with one or more selected classes of acoustic scenes.
Acoustic-scene classification generally involves three aspects. First, signal processing techniques can be applied to a sequence of frames representing an acoustic signal to derive several statistical measures (sometimes referred to in the art as “acoustic features”) of each frame. Second, tuned acoustic models (e.g., machine learning systems) can determine one or more activity hypotheses that explain the observed combinations of acoustic features. Third, a heuristic layer can be applied to the activity hypotheses in an attempt to resolve or otherwise “clean up” apparent mistakes made by the inference module.
A common problem faced by previously known scene classification systems is that the inference module often makes mistakes, which manifest as spurious, short-lived classifications (sometimes referred to as “false detections” or “gaps”). A heuristic layer can remove mistakes or fill in gaps in classifications made by the inference layer, but such processing by the heuristic module can result in substantial computational overhead and inefficient speech recognition or other acoustic-scene interpretation systems.
Thus, a need remains for computationally efficient techniques for classifying acoustic scenes, as well as computationally efficient scene classifiers. As well, a need remains for relatively more accurate inference techniques and modules. And, improved acoustic-scene interpretation systems, including real-time interpretation systems, are needed.