Many applications benefit from knowledge about the location of the user, for instance, tagging of uploaded consumer videos based on the geo-location. Because people spend most of their time indoors, it is often desired to also identify the room environment of a user. Under ideal conditions, GPS technology can predict an outdoor geo-location up to a few meters accurately. Inside buildings however, this technology is known to fail. Attempts to additionally use the strength of WiFi signals to gain a better accuracy are known. If WiFi coverage is insufficient, or the capturing device does not support this technology, the indoor location cannot be estimated.
In general, people spend most of their time indoors and, as such, in reverberant environments. For extracting information from a reverberant audio stream, the human auditory system is well adapted. Based on accumulated perceptual experiences in different rooms, a person can often recognize a specific environment just by listening to the audio content of a recording; e.g., a person can distinguish a recording made in a reverberant church from a recording captured in a conference room.
With the emerging trend of location-based multimedia applications, such as automatic tagging of uploaded user videos, knowledge about the room environment is an important source of information. GPS data may only provide a rough location estimate and tends to fail inside buildings. Attempts to use the strength of WiFi signals to gain a better accuracy were presented, e.g., in E. Martin, O. Vinyals, G. Friedland, and R. Bajcsy, “Precise indoor localization using smart phones,” In Proceedings of the international conference on Multimedia, pages 787-790. ACM, 2010. However, in these approaches, the location must be estimated and stored as meta data at the time of the capturing process. If either GPS and WiFi coverage is insufficient, or the capturing device does not support location identification technology, the location cannot be estimated. In A. Ulges and C. Schulze. “Scene-based image retrieval by transitive matching”, In Proc. of the ICMR, pages 47:1-47:8, Trento, Italy, 2011 ACM, an alternate method predicts common locations by relying on identifying visual similarities (landmarks or similar interior objects). This approach does not account for changes in spatial configurations that may occur, like when new tenants or home owners move furniture or redesign their rooms. In H. Malik and H. Zhao “Recording environment identification using acoustic reverberation”, In Proc. of the ICASSP, pages 1833-1836, Kyoto 2012, IEEE, a method is described to estimate the recording environment using a two-fold process; first, a de-reverberation process is applied on an audio recording to estimate the reverberant part from the signal. In other words, the reverberant component has to be filtered out from the audio recording. This process (also known as blind de-reverberation) is computational demanding and may not be suitable for low-power mobile devices such as smart phones, hearing aids, etc. Second, 48 audio features are extracted from the estimated reverberant part and used to train room models using a support vector machine (SVM) learning concept for identifying the acoustic environment. Thus, to identify a recording environment in Malik, the reverberant components within an unknown audio recording have to be estimated first using the blind de-reverberation step. Then, the acoustic features can be extracted and used in the SVM to estimate the recording environment. Malik system was tested only for speech of two people. The present invention is different since it does not do a blind de-reverberation. Instead the present invention extracts acoustic features directly from the audio recording which can be speech or musical signals. Also, the machine learning of the present invention uses a different approach, i.e. Gausian mixture model (GMM)-Universal Background Model (UBM).