The specification relates to guiding computational perception through a shared auditory space.
Blindness is an age-related disease. As the world's population continues to get older, the number of blind and visually impaired individuals will likely only increase. These individuals often want to know about what they hear in the world around them. They want to know what other people can “see”. Existing systems fail to provide an effective solution that can help these individuals learn about arbitrary objects that they can only hear about, that they do not know the exact location of, or that they do not know uniquely identifiable traits about.
In some cases, guide dogs, which are probably the most well-known aide outside the blind community, are used to help these individuals. However, these dogs are expensive to train, require a lot of work to keep, and are capable of serving these individuals only for a limited number of years. While these guide dogs provide available services, as the blind population grows, providing guide dogs to all these individuals is not realistic.
Today, a robot is capable of watching a lot of objects and/or actions in its surroundings including people, cars, advertisements, etc. Sizable online databases even allow real-time training for computational perception, to create new classifiers on the fly as needed. However, such a robot generally cannot run an endless number of classifiers all the time. It is too computationally intensive, would likely generate too many false positives with even the best of classifiers, and would overwhelm a human user associated with its corresponding system. Some existing robotic solutions have demonstrated how a human can guide the system using gesture or speech and some include robots that are configured to localize sound sources using onboard microphone arrays. However, these solutions generally only utilize what a single agent can detect about an object of interest. For instance, these solutions take into consideration either what a human has detected or what a robot has detected about the object of interest but generally not both. As a result, these solutions often lead to poor accuracy, ambiguity, and can lead to poor guidance for other computational perception systems.
Some existing solutions can find objects of interest in a visual space and then guide a computer vision system to the right target. For audible objects, sound source localization has been used to guide other sensors, but not generally in conjunction with what a human can hear. For instance, in some popular solutions, GPS location is used to query a mobile computer about its surroundings. However, these solutions have difficulty identifying transient sources of noises. In particular, these solutions often fail to query about sources that move or are short in duration because they do not persist long enough to be queried by GPS proximity. Other existing solutions use pointing gestures in Human-Robot interaction to guide a robot to a target in multiple perceptual domains. However, these solutions often generally require a shared visual space between a human and a computer exist and are therefore inflexible. In another solution, multi-array sound source localization is used to identify audible objects. Although this solution can identify auditory objects of interest, it suffers from significant limitations. For instance, this solution assumes that all sensors are microphone arrays having similar capabilities and limitations, which is impractical. In addition, a user using this solution would have to wear additional hardware in order to triangulate accurately on the source location. Further, the solution accounts for only one type of stimuli, which is often not unique enough to be of interest to a user. As a result, any other sounds that the user might be interested in would have to be pre-specified and trained, which is time consuming and computationally expensive.