For a mobile robot to operate autonomously, it should be able to learn about, locate, and possibly avoid objects as it moves within its environment. For example, a ground mobile/air/underwater robot may acquire images of its environment, process them to identify and locate objects, then plot a path around the objects identified in the images. Additionally, such learned objects may be located in a map (e.g., a world-centric, or allocentric human-readable map) for further retrieval in the future, or to provide additional information of what is preset in the environment to the user. In some cases, a mobile robot may include multiple cameras, e.g., to acquire sterescopic image data that can be used to estimate the range to certain items within its field of view. A mobile robot may also use other sensors, such as RADAR or LIDAR, to acquire additional data about its environment. RADAR is particularly useful for peering through smoke or haze, and lidar returns can sometimes be used determine the composition of objects within the environment.
A mobile robot may fuse LIDAR, RADAR, IR, ultrasound, and/or other data with visible image data in order to more accurately identify and locate obstacles in its environment. To date, however, sensory processing of visual, auditory, and other sensor information (e.g., LIDAR, RADAR) is conventionally based on “stovepiped,” or isolated processing, with little interactions between modules. For this reason, continuous fusion and learning of pertinent information has been an issue. Additionally, learning has been treated mostly as an off-line method, which happens in a separate time frame with respect to performance of tasks by the robot.
As opposed to this, animals perform both learning and performance simultaneously, effortlessly segmenting sensory space is coherent packets to be fused in unique object representations. An example is a conversation between two people in a crowded party, where the signal-to-noise ratio (S/N) of the speaker voice is extremely low. Humans are able to focus visual attention to the speaker, enhance S/N, bind the pitch of the speaker to the appropriate person speaking, and learning the joint “object” (visual appearance and speaker identity) so that recognition of that person is possible with one modeality alone.