There are many scenarios where being able to recognize audio environments and/or events can prove to be especially beneficial. This is because audio often provides a common thread that ties other sensory events together. Being able to exploit this audio characteristic would allow for products and services that can facilitate such things as security, surveillance, audio indexing and browsing, context awareness, video indexing, games, interactive environments, and movies and the like.
For example, workloads for security personnel can be lessened by reducing demands that would otherwise overwhelm a worker. Consider a security guard who must watch 16 monitors at a time, but does not monitor the audio because listening to the 16 audio streams would be impossible and/or might violate privacy. If sound events like footsteps, doors opening, and voices and the like can be recognized, they could be shown visually along with the video to enable the worker to have a better sense of what's going on at each location watched by the 16 monitors. Likewise, surveillance could be enhanced by distinguishing between sound events. For example, baby monitors are currently triggered by sound energy alone, creating false alarms for worried parents. If a monitor could differentiate between crying, gurgling, lightning, and footsteps and the like and trigger a baby alarm only when necessary, this would increase the safety of the baby through a much more reliable monitoring system, easing parents' concerns.
Sometimes because an audio recording is extremely long and contains a lot of information, it is very time consuming for an audio editor to review it. Current technology often just displays an audio waveform on a timeline, making it very difficult to browse visually to a desired spot in the recording. If it were possible to recognize and label different events (e.g., voices, music, cars, etc.) and environments (e.g., café, office, street, mall, etc.), it would be far easier to browse through the recording visually and find a desired spot to review. This would save both time and money for a business that provided such editing services.
Occasionally, it is also beneficial to be able to easily discern what type of environment a device is currently located in. With this type of “contextual awareness,” the device could adjust parameters to compensate for such things as noise levels (e.g., noisy, quiet), and/or appropriateness (e.g., church, funeral) for a particular action and the like. For example, the loudness of a cell phone ring could be adapted to respond based on whether a user was in a café, office, and/or lecture hall and the like.
It is also desirable to be able to synthesize auditory environments effectively with high accuracy. A film sound engineer might want to recreate an office meeting environment to utilize in a new film. If the engineer can create or synthesize an office environment, a discussion on a multi-million dollar controversial condominium development can be dubbed onto the recording so that the audience believes the conversation takes place in an office. As another example of environmental interest, a recording of the ‘great outdoors’ can be made. The recording might have the sweet sound of bird chirps and morning crickets. Parts of the environmental sounds could be synthesized into a gaming environment for children. Thus, sound synthesizing is highly desirable for interactive environments, games, and movies and the like.
Video indexing is also an area that could benefit substantially by recognizing auditory events and environments. There are a variety of current techniques that break a video up into shots, but often the visual scene changes drastically as a camera pans from, for example, a café to a window, and the techniques incorrectly create a new shot. However, during the panning, oftentimes the audio remains similar. Thus, if an auditory environment could be reliably recognized as being similar, it could be determined that a visual scene has not changed. Additionally, this would allow the ability to retrieve particular kinds of scenes (e.g., all beach scenes) which are very similar in terms of auditory environments (e.g., same types of beach sounds), though quite different visually (e.g., different weather, backgrounds, people, etc.).
Thus, being able to efficiently and reliably recognize auditory events and environments is extremely desirable. Techniques that could accomplish this could benefit a wide range of products and industries, even those that are not typically thought of as being driven by audio related functions, easing workloads, increasing safety, increasing customer satisfaction, and allowing products that would not otherwise be possible. It would even be able to enhance and extend an existing product's usefulness and flexibility.