Today there is a trend to create and deliver richer media experiences to consumers. In order to go beyond the ability of either sample based (video) or model-based (CGI) methods novel representations for digital media are required. One such media representation is SCENE media representation (http://3d-scene.eu). Therefore, tools need to be developed for the generation of such media representations, which provide the capturing of 3D video being seamlessly combined with CGI.
The SCENE media representation will allow the manipulation and delivery of SCENE media to either 2D or 3D platforms, in either linear or interactive form, by enhancing the whole chain of multidimensional media production. Special focus is on spatio-temporal consistent scene representations. The project also evaluates the possibilities for standardizing a SCENE Representation Architecture (SRA).
A fundamental tool used for establishing the SCENE media representation is the deployment of over-segmentation on video. See, for example, R. Achanta et al.: “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43 (2012), pp. 2274-2282. The generated segments, also known as superpixels or patches, help to generate metadata representing a higher abstraction layer, which is beyond pure object detection. Subsequent processing steps applied to the generated superpixels allow the description of objects in the video scene and are thus closely linked to the model-based CGI representation.
An application evolving from the availability of superpixels is the generation of superpixel clusters by creating a higher abstraction layer representing a patch-based object description in a scene. The process for the superpixel cluster generation requires an analysis of different superpixel connectivity attributes. These attributes can be, for example, color similarity, depth/disparity similarity, and the temporal consistency of superpixels. The cluster generation usually is done semi-automatically, meaning that an operator selects a single initial superpixel in the scene to start with, while the cluster is generated automatically.
A well-known clustering method for image segmentation is based on color analysis. The color similarity of different picture areas is qualified with a color distance and is used to decide for a cluster inclusion or exclusion of a candidate area. The final cluster forms the connected superpixel. However, this method does not work reliable in cases where scene objects are indistinguishable by means of their colors. In such cases the clustering based on color information will combine superpixels belonging to different objects, e.g. a person and the background, into the same cluster of connected superpixels and thus violating the object association. This weakness can be handled by additionally analyzing depth information available in the image. By incorporating the depth distance measures between superpixels for a cluster forming the results of connected superpixels are improved. Incorporating color and depth allows detecting the object borders and helps avoiding the presence of foreground and background elements within a connected superpixel.
A further difficulty arises from the different properties given for features like color and depth when generating the connected superpixels. While the color clustering evaluates the color distance between one primary selected superpixel and any candidate superpixels for the cluster affiliation, the depth evaluation has to consider pairs of superpixels which are directly neighboring only. This is necessary as the depth information represents a three dimensional surface in the scene and a superpixel depth distance measured between the first initial superpixel and the second far away candidate superpixel potentially generates very large depth distances and thus undermines any threshold criteria. Therefore, the cluster forming based on depth requires a propagating cluster reference, where the reference is moving to the border of the growing cluster of superpixels. This is different from the cluster forming based on color, which requires a fixed cluster reference.
The deviating properties manifested by the fixed cluster reference needed for color information and the propagating cluster reference needed for depth information impede a homogeneous cluster forming. Thus the cluster forming for color and depth is typically separated. In a first step the individual clusters for depth and color are generated, which are determined independently from each other by ignoring any cross information. In a second step the two cluster results are combined by intersecting the sets. A disadvantage of this solution is that disrupted shapes of the superpixel clusters may appear, which consist of isolated areas. Such a result infringes the connectivity objective.