The present disclosure relates to autonomous human-centric place recognition.
Today many autonomous computer systems rely on detection and recognition techniques for various different applications. In place or environment classification, systems are rapidly improving through the use of complex auditory, visual or multimodal learners. The challenge, however, is not in the classification of good data, but rather in overcoming poor sensor positioning at the time knowledge is required. For instance, a robot interacting with a person on a couch may see a large wall behind the person, but that wall may not contain adequate scene complexity to correctly classify the environment. Alternatively, even with a relatively open view of the environment, some rooms are multi-purpose, defying simple classification strategies. Further, when a robot is crossing from one room into the next, it often had difficulty identifying that transition and determining the correct context. When an autonomous agent, such as a robot, needs to make a decision based off the classification results, these “boundary conditions” become a significant barrier to deployment on a mobile sensor.
Place recognition or labelling is not a new field. It is also commonly called scene recognition, and/or place categorization. At this point, there are a variety of approaches, and sensors, that can be used for identifying the type of scene that is currently being observed. There are existing methods that categorize the type of objects in the environment and then learn the semantic place label associated with those objects, such as that described by Shrihari Vasudevan, Stefan Gächter, Marc Berger & Roland Siegwart, “Cognitive Maps for Mobile Robots—An Object based Approach”, Intelligent Robots and Systems (IROS), San Diego, USA, 2007.
There are also existing methods that do straight image-based classification; new work in deep learning, for instance utilizes large image databases now available online for a single viewpoint classification, as described by Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva, “Learning Deep Features for Scene Recognition using Places Database”, NIPS 2014.
The foregoing single observation classification methods, however, are unable to identify the place label by themselves when the camera or sensor is poorly positioned. They also commonly fail when categorizing places that contain more than one environment, answering one environment or the other, and sometimes neither.
To correct for these errors in sensor positioning, the robotics community has focused on fusing sensor data over physical space. One possibility discussed is to use commonly available location sensors, e.g. GPS, to compare the picture location to a previously labeled map. Then, the combination of GPS predicted location and classified place label are used to estimate the place. This approach, which is discussed in U.S. Pat. No. 8,798,378 by Boris Babenko, Hartwig Adam, John Flynn, and Hartmut Neven, titled Scene Classification for Place Recognition, improves problems with poorly positioned sensors, but is designed for larger place categories like city, or tourist attraction. It does not solve labelling challenges with small indoor environments, transition regions, or multi-purpose spaces.
Another form of sensor fusion is to construct a topological map of the environment. The idea, as described by Aravindhan K Krishnan and K Madhava Krishna, “A Visual Exploration Algorithm using Semantic Cues that Constructs Image based Hybrid Maps”, Intelligent Robots and Systems (IROS), Taipei, Taiwan, 2010, is to take advantage of a video stream from a mobile sensor, rather than the single image approach, by seeking for images that are significantly different from the ones that came before them. While mapping the environment, the robot clusters regions of similar class and self-identifies change points between one room and the next. The resulting map is more of a topological graph. Although this method shows improvement for poor sensor positioning, it assumes that each room has a homogeneous purpose, and that transitions are well defined—something that is often not true in real environments. A further similar approach for improving a classification algorithm without generating the actual map is described in U.S. Pat. No. 8,565,538 by Ananth Ranganathan, titled Detecting and Labeling Places using Runtime Change-point Detection.
An alternative fusion method is the occupancy grid. Ananth Ranganathan and Jongwoo Lim in their work titled “Visual Place Categorization in Maps”, Intelligent Robots and Systems (IROS), San Francisco, USA, 2011, describe using each measurement from a place recognition algorithm to update an occupancy grid as the robot moves through the space. Importantly, each measurement update reflects the region of view observed by the camera, attempting to learn a classification for both obstacles and empty space in the occupancy grid. As with topological maps, this sensor fusion strategy helps overcome basic directionality problems, particularly from cameras, but it also introduces additional problems. First, the map does not directly answer the place recognition question. Given a map, how does a robot identify the place label for use in its application? This group does not apply the map to any application—so it does not address how to best utilize the resulting fused representation in human robot interaction or any other domain. The second problem is that this map is a static representation focused on point cloud data. It is difficult to make changes in real time to a map representation, or incorporate non-point cloud data, either of which may aid in reducing ambiguity in multipurpose environments.