Computer-vision provides computers or automated machines with visual abilities. Thus, it is desirable in computer-vision to provide such systems with the ability to reason about the physical world by being able to understand what is being seen in 3D and from images captured by cameras for example. In other words, applications in robotics, virtual-reality (VR), augmented-reality (AR), and merged reality (MR) may need to understand the world around the robot or person providing the point of view in the applications. For example, a robot needs to understand what it sees in order to manipulate (grasp, move, etc.) objects. VR, AR, or MR applications need to understand the world around the person providing the point of view so that when the person moves in such a world, the person is shown to avoid obstacles in that world for example. This ability also permits such computer vision systems to add semantically plausible virtual objects to the world environment. Thus, a system that understands it is seeing a lamp, can understand the purpose and operation of the lamp. For these purposes, a 3D semantic representation of the world in the form of a semantic segmentation model (or just semantic model) may be formed by using 3D semantic segmentation techniques.
Such semantic segmentation techniques often involve constructing a 3D geometric model, and then constructing a 3D semantic model based on the geometric model where the 3D semantic model is formed of voxels that are each assigned definitions for the object those voxels are part of in a 3D space, such as furniture like “chair”, “sofa”, “table”, or parts of the room, such as “floor” or “wall”, and so forth. The 3D semantic model is updated over time by segmenting a current frame to form a segmented frame, and registering the segmented frame to the model either based on heuristic rules or a Bayesian update as well as the current camera pose used to form the current frame. The semantic model then may be used by different applications, such as computer vision, to perform tasks or analysis of the 3D space as described above.
Such updating of the semantic segmentation model, however, is often inaccurate and results in low performance because it does not adequately factor the history of the semantic updating. In other words, the semantic segmentation is often updated a frame at a time. A current frame is semantically segmented to form a segmented or label frame, and this is repeated for individual current frames in a video sequence. Each semantically segmented frame, depending on a camera (or sensor) pose used to form the current frame, are then used to update the semantic model. This is typically performed without factoring the sequence or history of semantic updating that occurred previously during a video sequence while performing the semantic segmentation of the current frame to form the segmented frame. This results in a significantly less accurate analysis resulting in errors and inaccuracies in semantic assignments to vertices or voxels in the semantic model.