This invention relates to three-dimensional data analysis and more particularly to object segmentation for pattern recognition with application to computer vision.
Object segmentation has been a key technique for semantic object extraction and is useful in digital video processing, pattern recognition, and computer vision. The task of segmenting/tracking a three dimensional image in the form of a video object emerges in many applications, such as video monitoring and surveillance, video summarization and indexing, and digital entertainment. A sampling of applications includes:                Video surveillance, where the segmentation result is used to allow the identification of an intruder or of an anomalous situation and helps to anticipate and reveal patterns of actions and interactions with one another in an environment to determine when “alerts” should be posted to a security unit.        Content-based video summarization, such as sports event summary, video skimming, video pattern mining, namely, tasks that require the segmented semantic objects to perform the content classification, representation or understanding.        Content-based coding applications in which each frame of a video sequence is segmented into semantically meaningful objects with arbitrary shape.        Computer vision, such as video matting, video “tooning” and rendering, where segmented two-dimensional objects from the input image or video sequences can be used for 3-D scene reconstruction.        Videoconferencing and video telephony applications, in which segmentation can achieve a better quality by coding the most relevant objects at higher quality.        Digital entertainment, where some specific objects can be replaced by segmentation, such as the video games.        
Other possible applications include industrial inspection, environmental monitoring, or the association of metadata with the segmented objects, etc.
Human image object segmentation is generally considered a crucial step for human recognition, behavior analysis or human-to-machine communication. The dataset and characteristics obtained from an image or the like as a so-called human object can be applied in many fields, such as video surveillance, computer vision, and video entertainment. For example, the extracted human object can be used to allow the identification of suspicious behavior, and it may help to detect problematic actions and alert a security center to possible dangers.
Generally, object segmentation can be divided into two stages, i.e., desired object detection, concerned with the pattern recognition, and object extraction, concerned with clustering techniques. In detection mode, object segmentation can be performed in two ways, supervised and unsupervised. However, it is usually difficult to find the desired object automatically (unsupervised) due to various object features, such as color, intensity, shape, and contour. To avoid false detection for segmentation of an object of interest, many interactive methods have been developed, which need to define the desired object in advance by the user. Since the complicated step of object detection is avoided at the cost of interactive effort on the part of the user, these methods usually can provide users with much better segmentation performance than automatic ways.
In order to satisfy the future content-based multimedia services, the segmentation of meaningful objects in unsupervised manner is urgently required in the real-world scenes.
Many video segmentation approaches can be found in the literature, and generally make uses of both spatial and temporal information. The spatial segmentation method partitions each frame into homogeneous regions with respect to color or intensity. Typical partition approaches can be generally divided into region-, boundary-, and classification-based approaches.
The spatial segmentation approach, which involves the region growing, splitting, and merging, relies on the homogeneity of localized features such as color, texture, motion, and other pixel statistics. The temporal segmentation approach employs primary gradient information to locate object boundaries. In the classification-based approach, a partition of the feature space is first created and then translated into the video signal. This method enables a combination of cues, such as texture, color, motion, and depth. The spatial segmentation approach can yield relatively accurate object boundary. However the computational complexity is sufficiently high and to limit usage to other than real-time applications since the segmentation has to be done on the whole image for every frame. In addition, a main issue of the spatial-based approaches is the lack of robustness for the ‘corrupted’ cases, such as a noisy or blurry video image where the boundaries of a region are usually missed or blended with other regions.
Temporal segmentation, on the other hand, utilizes motion rather than spatial information to obtain the initial position and boundary of objects. So-called change detection masks are the most common forms of motion information incorporated into the segmentation process. Because the objects of interest are usually moving, change detection can be done on the inter-frame or background-frame basis. Due to the image noise, objects boundaries are often irregular and must be refined using the spatial information of the image. As the boundary fine-tuning procedure involves only the segmented moving region instead of the whole frame, higher efficiency is achieved. However, shadow effects, reflections and noise might be incorrectly assigned to foreground objects. Moreover, it is usually difficult to distinguish between changes due to true object motion and changes due to noise, shadow effects, etc.
Most existing video image segmentation techniques fail to automatically extract the objects in the image, since objects of interest usually correspond to multiple regions that may have very great spatial-temporal variations. It is difficult to segment these objects automatically without any primary criteria for segmentation. An intrinsic problem of the “blind-segmentation” algorithms, which have no contextual knowledge assumption regarding the object being segmented, is that objects of interest may not be homogeneous with respect to low-level features, or the objects may change with the environmental factors, such as lighting conditions, etc.
For these and other reasons, there is a need for improved object segmentation adapted to the dynamic human form.