Video segmentation, or segmentation of RGBt volumes, has many applications that include but are not limited to security, data summarization, entertainment, video conferencing etc. Real-time and fast video processing is often a requirement for such algorithms, which puts constraints on their complexity. For applications where the results are directly presented to human observers, segmentation algorithms have additional performance requirements, since humans are very sensitive to visual artifacts like noisy boundaries and flickering, as well as semantics-related errors such as missed parts of semantically consistent entities and objects.
The accent is on immersive experiences when implementing these human observer-targeted applications, where the subjects are extracted from their surroundings and placed on virtual backgrounds, either for entertainment purposes, or for remote communication and collaboration. In order for such applications to be truly immersive and functional, the segmentation of users and other objects of interest in the local scene has to be done in real time. Furthermore, segmentation boundaries should be precise and the extracted RGBt volumes should be smooth, coherent and semantically cohesive, to provide a pleasant viewing experience.
However, traditional segmentation processes used within these immersive applications do not perform consistently under certain conditions typical of one or more image backgrounds. For instance in a video-conferencing scenario, one or more participants may reside in multiple environments, are segmented from their respective surrounding environments or backgrounds, and placed on virtual backgrounds for viewing by the other participants. Typically, when performing segmentation procedures using only in the red, green, and blue (RGB) domain, extraction of a subject from their respective background environment makes assumptions that severely limit their performance. These assumptions either impose a static, non-changing background, or require the background to follow a limited, statistical model to distinguishing a background from a foreground. As a result, these processes have trouble performing segmentation in common environments resulting in missing parts of the subject. For example, some common environments that provide segmentation problems may include a dynamically changing street environments in the background, an environment that closely matches clothes on a subject; and where shadows weaken edge boundaries between a subject and background.
What is desired is a segmentation process that reduces segmentation errors and unwanted artifacts, such as, flickering, non-smooth boundaries, and the omission of areas of an object-of interest.