As computer vision and multimedia analysis technologies evolve, multimedia information resource, which are characterized by being intuitive and vivid, are becoming increasingly accessible, diversified, and publicly popular. As a result, the means to efficiently locate and separate needed video resource from video information has also become a hot research topic.
The video semantic object segmentation is a technology that, for a specific semantic category, locates and separates video pixels of objects that fall under such category from within an input video. The technology is applicable to massive Internet analysis, video editing for film making, and video-based 3D modeling, etc. Existing video semantic object segmentation methods are mostly parameterized methods which involve labeling each single location where an object is located in a video, collecting mass image video in which object locations or contours have been labeled, learning from the collection a coherent vision model represented by parameters, and applying the vision model to an input test video by performing object segmentation for target objects in the input test video based on the vision model thus learned. For example, Kundu et al. from Georgia Institute of Technology proposed a method based on feature space optimization for semantic video segmentation, which is a parameterized method that obtains a vision model by feeding massive quantities of accurately labeled video frames into a learning convolutional neural network. Lei et al. from University of Michigan proposed, in 2016, a machine learning model recurrent temporal deep field, and applied the same to video semantic object segmentation. However, such parameterized methods are disadvantageous in that, on one hand, the use of parameterized methods requires accurately labeling mass images to obtain training samples, which can be a difficult and time consuming process; and on another hand, it is difficult to efficiently update and iterate such parameterized models obtained through such training according to newly added images, a fact that suggests suboptimal adaptation to dynamic growth in vision resources. For example, if new training samples or semantic categories are added to an existing vision system, a parameterized method will have to re-train its vision model, which will be a laborious and tedious procedure that, for modern machine learning models, could take days or even weeks.