(1) Field of the Invention
The invention relates to a method for automatic object detection and subsequent object tracking in accordance with the object's shape. The invention also relates to a system therefor.
(2) Description of Related Art
The automatic detection and tracking of moving objects is not only of central importance in video surveillance, but also in many other areas of video technology and image processing. A large number of so-called object tracking methods exist, but usually these are limited to the determination of the object's actual position. For many applications, in addition to the current position of the object, the shape and orientation of the object is also an area of interest.
There exists a variety of tracking methods for object tracking. Among the best known and widely used methods are Kalman filter tracking, Mean-shift tracking and particle filter tracking, as well their extensions and variations. For example, U.S. Pat. No. 6,590,999 B1 describes a method and an apparatus for object tracking in accordance with Mean-Shift Tracking, namely a Mean Shift tracking in real time for an object target variable in the shape, such as humans. The object tracking is based on visually recognizable features, for example color or structures, wherein the statistical distribution of these features characterizes the target. In a first step, the degree of similarity between a predetermined target and a comparison target is calculated, and in a subsequent step, the degree is calculated by itself, and expressed by a metric, which is derived from the Bhattacharyya coefficient. A gradient vector derived from the maximum value of the Bhattacharyya coefficient is then used for determining the most probable location of the target in the following sections.
All the traditional methods can identify the position of an object reasonably robustly, and may be able to partially also determine the size of the object. A determination of the actual object shape and orientation of the object is not, however, possible using the traditional methods.
A tracking of the shape of the object is possible only by extensions and improvements of the original procedures. Above all, the particle Filter and the mean-shift methods discussed above have been developed further in this direction.
In the conference paper “Particle filtering for geometric active contours with application to tracking moving and deforming objects” by Rathi, Y. and Vaswani, N. and Tannenbaum, A. and Yezzi, A. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, 2005, such a particle filter version is described. Although the shape of the object can be tracked quite well, this approach has some drawbacks.
For example, some information on the object form is provided to the algorithm such that by greater occlusion of the object, the objects' shape can to be described. This in turn leads to the fact that by very large deformations, the shape cannot be tracked very accurately.
The performance of the method is also extremely reduced in the event that the object is completely hidden for a long time.
A further development of the mean-shift procedure for tracking the shape of the object was presented in the conference paper “Object Tracking by asymmetry kernel Mean Shift with Automatic Scale and Orientation Selection” by A. Yilmaz, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 2007, pp. 1-6. Instead of a symmetric filter kernel, a level-set functions of certain filter core is used, which is adapted to the shape of the object. Furthermore, the search space is expanded by a scaling- and orientation dimension. Thus, in addition to the object position, also the size and the orientation of the object or its contour can be determined. However, since the orientation of the object is only calculated within a 2D image plane, the object shape cannot be adapted to the actual movement of the object in three dimensional spaces.
Another tracking algorithm, which cannot be assigned to one of these three basic methods discussed above, is based on the so-called machine learning approach. In this approach, both Hidden Marko Models and geometric object features can be considered to calculate the object's shape. Since the method determines the contour points of the object by a classification, the method must first be trained using a training set of the classifier (certain characteristics). Thus, of course, a training set must be present or generated. Because each pixel must be considered in the classification, a particularly large amount of features and thus a relatively large training set is required.
In general, also typically for most tracking methods is that they cannot detect objects to be tracked automatically. Many tracking algorithms are therefore either dependent on user inputs or results from a previously performed object recognition. In general, a system for object tracking comprises therefore a component for object recognition and the actual tracking algorithm.
In FIG. 9 the schematic procedure of an automated object tracking of such a system according to the prior art is shown consisting of an object recognition based on the Gaussian Mixture Models and the mean-shift tracking.
Adaptive Gaussian Mixture Models are a widely used background subtraction. As proposed in C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 1999, each pixel of a scene can be modeled by a mixture consisting of K different Gaussian functions. The modeling is based on the estimate of the probability density of the color values of each pixel. It is believed that the color of a pixel value is determined by the surface of the object that is imaged on the pixel under consideration. In the case of an ideal and static scene without noise, the probability density of a color value of a pixel can be described by a Dirac-Impulse-Function. Due to camera noise and light illumination changes, however, in a real static scene, the color value of a pixel changes over time.
In non-stationary scenes can also be observed that up to K different objects k=1 . . . K can be mapped to a pixel. Therefore, for monochromatic video sequences, the probability density of a pixel color value X, caused by an object k, can be modeled by the following Gaussian function with mean μk and standard deviation σk:
                              η          ⁡                      (                          X              ,                              μ                k                            ,                              σ                k                                      )                          =                              1                                                            2                  ⁢                  π                                            ⁢                              σ                k                                              ⁢                      ⅇ                                          -                                  1                  2                                            ⁢                                                (                                                            X                      -                                              μ                        k                                                                                    σ                      k                                                        )                                2                                                                        (        1        )                                          η          ⁡                      (                          X              ,                              μ                k                            ,                              Σ                k                                      )                          =                              1                                                            (                                      2                    ⁢                    π                                    )                                                  n                  2                                            ⁢                                                                                      Σ                    k                                                                                      1                  2                                                              ⁢                      ⅇ                                          -                                  1                  2                                            ⁢                              (                                  X                  -                                      μ                    k                                                  )                            ⁢                                                Σ                  k                                      -                    1                                                  ⁡                                  (                                      X                    -                                          μ                      k                                                        )                                                                                        (        2        )            where Σ denotes an n by n large covariance matrix of the form Σk=σk2I, since it is assumed that the RGB color channels are independently and possess the same standard deviation. This assumption does not correspond with the facts, but avoids a very computationally intensive matrix inversion. The probability that a pixel x in the image t has the color value X corresponds to the weighted mixture of the probability density functions of the k=1 . . . K objects that can be mapped at the pixel:
                              P          ⁡                      (                          X              t                        )                          =                              ∑                          k              =              1                        K                    ⁢                                    ω                              k                ,                t                                      ·                          η              ⁡                              (                                                      X                    t                                    ,                                      μ                                          k                      ,                      t                                                        ,                                      Σ                                          k                      ,                      t                                                                      )                                                                        (        3        )            with weighting factor ωk. In practice K is often restricted to the values 3 to 5.
The GMM algorithm can now be divided into two steps. First, for each new image of the video sequence, the existing model must be updated. Using the model, then an actual picture of the background is formed; subsequently the current image can be divided into front and background. For updating the model is verified whether the current color value of one of the existing K Gaussian functions can be assigned.
A pixel is assigned a Gaussian function k, if:∥Xt−μk,t-1∥<d·σk,t-1  (4)where d denotes a user-defined parameters. This means that all color values that differ less than d·σk, t-1 from the mean, are assigned to the k-ten Gaussian function. On the other hand, the condition can also be interpreted to mean that all color values are assigned to the Gaussian function, which lies within the area corresponding to the probability p0:
                                          ∫                                          μ                                  k                  ,                                      t                    -                    1                                                              -                              d                ·                                  σ                                      k                    ,                                          t                      -                      1                                                                                                                          μ                                  k                  ,                                      t                    -                    1                                                              -                              d                ·                                  σ                                      k                    ,                                          t                      -                      1                                                                                                    ⁢                                    η              ⁡                              (                                                      X                    t                                    ,                                      μ                                          k                      ,                                              t                        -                        1                                                                              ,                                      Σ                                          k                      ,                                              t                        -                        1                                                                                            )                                      ⁢                          ⅆ                              X                t                                                    =                  p          0                                    (        5        )            
If X can be assigned to a distribution the model parameters are adjusted as follows:ωk,t=(1−α)ωk,t-1+α  (6)μk,t=(1−ρk,t)μk,t-1+ρk,tXt  (7)σk,t=√{square root over ((1−ρk,t)σk,t-12+ρk,t(∥Xt−μk,t∥)2)}{square root over ((1−ρk,t)σk,t-12+ρk,t(∥Xt−μk,t∥)2)}  (8)where ρk,t=α/ωk,t after P. W. Power and J A Schoonees, “Understanding background mixture models for foreground segmentation,” in Proc. Image and Vision Computing, 2002, p. 267-271. For the other distributions, where X cannot be assigned, only the value for ωk,t according to Eq. (9) is calculated:ωk,t=(1−α)ωk,t-1  (9)
While the other parameters remain unchanged.
The Gaussian functions are sorted according to a confidence measure ωk,t/σk,t so that with increasing index k the reliability decreases. Where more than one pixel is assigned a Gaussian distribution, it is allocated to those with the highest reliability. If the condition in Eq. (4) does not apply, and a color value of none of the Gaussian distributions can be assigned, the least reliable Gaussian function is replaced by a new Gaussian distribution with the current image point as an average. This new Gaussian function is initialized with a small probability of occurrence and a large standard deviation. Subsequently, all ωk, t are scaled. A color value is considered more likely (lower k) with higher probability as background if it shows up frequently (ωk)) and will not change much (ωk). In order to determine the B distributions to model the background, a user-defined prior probability T is used as a threshold:
                    B        =                              argmin            b                    ⁡                      (                                                            ∑                                      k                    =                    1                                    b                                ⁢                                  w                  k                                            >              T                        )                                              (        10        )            
The remaining K-B distributions are to the foreground.
The GMM algorithm for object detection (see 1) initially forms a model of the current background. By subtraction (see 2) of the current background model from the current frame, changing image regions are detected. Then from the difference between the background and the current image by thresholding (see 3) a binary mask BM is determined, which contains the moving image regions. By simple morphological operations (see 4) small deviations, often caused by noise and false detections, should be removed from the binary mask BM, and thus the binary mask BM is so refined. To determine contiguous object regions, the binary mask is subsequently subjected to a so-called Connected Component Analysis (see 5).
When recognized areas appear in successive images, the object is considered to be reliable detected (see 6). Through a simple comparison of detected objects and objects that have already been pursued, newly identified objects can be determined (7 and 7a: no new object tracking).
If a new object is detected, a bounding box in the shape of a simple rectangle is determined by the object. Within the bounding box again an ellipse is defined (see 8), whose size defines the size of the bounding box. Subsequently, on the basis of the pixels located within the ellipse, a histogram of the typical object characteristics (such as color) is formed. For histogram formation (see 9), an Epanechnikov-filter kernel is used, which makes the features of pixels at the edge of the ellipse lighter weight. Thus, the influence of background pixels that can appear on the edge of the ellipse will be reduced in the histogram.
The weighted histogram of the object is known as a so-called target model, as it is the target of mean-shift tracking (see 10) to find a near similar histogram or model of the object in the next picture. This target model is now used for initializing the traditional mean-shift tracking, and starts tracking the object by object position OP and video signal at the output of the camera VS K in the control room KR.
A method and a device and a computer program for detecting and/or tracking of moving objects in a surveillance scene where, besides the moving objects, interfering objects and/or disturbance areas may occur, is known from DE 10 2007 041 893 A1 for video surveillance systems. CCTV systems typically comprise a plurality of surveillance cameras and are used to monitor public or commercial areas. In accordance with the subject matter disclosed in DE 10 2007 041 893 A1, this is done by an image-based method for detecting and/or tracking of moving objects in a surveillance scene which is preferably implemented by means of digital image processing. In this connection, the detection comprises the initial recognition of the moving objects and tracking the recognition of the moving objects in subsequent images of the surveillance scene. The method is adapted to one or more moving objects to detect or pursue. For this purpose, in the surveillance scene several regions defined which can have any desired shape, for example round, rectangular or square, and may be also be arranged without overlap or to overlap. Regions are defined as image details of the monitoring scene, which are positioned over a monitoring period, are preferably stationary. The regions are divided into different class-sensitive regions, including a first region in which no interferers and/or neglected or be neglected interferers are arranged and/or to be expected. The division in the region classes, for example, can be carried out manually by a user and/or automatically by a first, for example, image-based content analysis of the monitoring scene. In the sensitive regions, a sensitive content analysis, in particular video content analysis, for detecting and/or tracking of moving objects is carried out. The sensitive content analysis includes, for example, the steps of formation or acquisition of a scene reference image, segmentation of objects, detection and/or prosecution of the segmented objects over time. It is also proposed to use a second region, whereas the semi-sensitive regions are classified are sheet and/or to be classified reproducibly, and whereas in the semi-sensitive regions in particular stationary and/or permanent disturbers are arranged and/or to be expected. For the detection and/or tracking of moving objects in the semi-sensitive regions a half-sensitive content analysis is performed, which in view of the used image processing algorithms is restricted and/or modified with respect to the sensitive content analysis. It is also proposed to supplement and/or replace insensitive regions by semi-sensitive regions, whereas in the semi-sensitive regions at least a limited content analysis of the surveillance scene is being carried out. First, it is possible, to implement this limited content analysis through the use of simplified image processing algorithms and on the other hand, it is possible to obtain information of moving objects, which were developed in the sensitive regions to use in the semi-sensitive regions further and thus the detection and/or to support tracking of moving objects in the semi-sensitive regions by means of information transfer. While remaining in video surveillance by this method furthermore still difficult detecting areas, but however, regions which are formed as blind spots are excluded or at least minimized. In a preferred embodiment of the invention, regions can be optionally divided into a third region class, which includes insensitive regions where such for example interferers are located, whereas no analyzes of content for detection and/or tracking of moving objects is being carried out in this insensitive regions. In this preferred embodiment of the invention, therefore, the several regions of the surveillance scene are divided in exactly three region classes, namely, sensitive, semi-sensitive and insensitive regions. In an extended embodiment of the invention, a plurality of semi-sensitive region classes are provided, wherein the different semi-sensitive region classes differs by the type of content analysis. In order to implement moving objects are detected and/or prosecuted wherein an unusual patterns of movement for a half sensitive region is determined. An example of an unusual pattern of movement occurs if an object in the semi-sensitive region moves against a general direction of movement in this semi-sensitive region. This occurs in practice, for example if a person or a vehicle moves against a general moving or driving direction. Another example of an unusual pattern of movement occurs if an object moves in the semi-sensitive region having a directional motion, whereas in this semi-sensitive region otherwise only undirected movements are detected. Preferably, the movement patterns are detected through the analysis of the optical flow (optical flow) in the semi-sensitive regions. The optical flow designated a vector field that specifies the 2D movement direction and −speed of image points and Pixels or areas of an image sequence. The device according to DE 10 2007 041 893 Al comprises a classification module, which is designed to define regions in the scene monitoring and to divide the regions in different region classes. A first class region relates to sensitive areas where no interferers and/or negligible interferers are arranged and/or are to be expected and a second class relates to semi-sensitive region regions where interferers arranged and/or to be expected. The device comprises at least a first and a second analysis module, wherein the first analysis module is adapted for detecting and/or tracking of moving objects in the sensitive regions and to carry out sensitive content analysis and the second analyzing module is configured to carry out in the semi-sensitive regions a semi-sensitive content analysis, which is limited and/or modified compared to the sensitive content analysis. The content analysis is particularly useful as video content analysis (VGA Video Content Analysis) and is preferably via digital image processing.
Furthermore, from DE 10 2008 006 709 A1 a video-based surveillance, in particular for the detection of a stationary object in a video-based surveillance system is known, whereas for the improved detection of stationary objects, the monitoring system, comprises                An image sensing module for detecting a video recording has an interesting image area;        A motion detection module that is adapted to recognize the presence of a moving object in the relevant image portion of the recorded video recording, and        a standstill detection module that is adapted to recognize the presence of a stationary object in the relevant image region, and which is active if the motion detection module is not recognizing a moving object in the relevant image area of a current video image of the detected video recording, whereas said standstill detection module further comprises        A pixel comparison module that is adapted to compare the pixel value of a pixel in the relevant image area in the current video image with the pixel value of a corresponding pixel in an immediately preceding video image as to determine the number of pixels in the relevant image portion of the current video image whose pixel values are identically with those of the corresponding pixel in the immediately preceding match video image;        A background identification module for identification of the background, which is adapted to identify the pixels in the relevant image region in the current video image that are part of a background and are based on a comparison of their pixel values with a background pixel value; and        A signal generating means for generating an output signal to indicate the detection of a stationary object, if the number of matches between the current video image and the immediately preceding video image pass over a threshold, since those pixels are subtracted in the current video image which has been identified as part of the background.        
In DE 10 2008 006 709 A1, the described monitoring method includes the following steps:                Detecting a video image, this has a screen area of interest;        Determining if in the image area of interest of an actual video image a moving object, based on a background subtraction, is present and        Failure to detect a moving object in the interesting image area of a current video image captured video image based said on background subtraction, performing a test to see if there is an interesting image area of a stationary object, wherein the test, comprises the following additional steps:        Comparing the pixel value of a pixel in the relevant image area in the current video image with the pixel value of a corresponding pixel in an immediately preceding video image to determine the number of pixels in the relevant image portion of the current video image whose pixel values corresponds with those of the corresponding pixel in the immediately preceding video image;        Identifying those pixels in the image region of interest in the current video image that are based on a comparison of their pixel values with a background pixel value are part of a background; and        Generating an output signal to indicate the detection of a stationary object, if the number of matches between the current video image and the immediately preceding video image pass over a threshold, since those pixels are subtracted in the current video image, which were identified as part of the background.        
The idea described in DE 10 2008 006 709 A1 is to provide a method by which the sustained detection of a stationary object is achieved with minimal processing power. The proposed method comes into play as soon as can be seen by the background subtraction due to the inherent limitation of the background algorithm no stationary object is detected. In one embodiment, to improve response time, the standstill detection module is only activated if the motion detection module cannot detect a moving object in an interesting image area of a current video image of the captured video recording after a moving object in the interesting image area of the immediately preceding video frame of the recorded video image has been detected.
Furthermore, the background pixel value by generating an image histogram is computed of the interesting image region containing only the background, and determines a pixel value corresponding to a mode of the histogram. This feature offers the advantage that only a single background pixel value is needed to determine whether a pixel is in the current video image is part of the background or a stationary object. Said motion detection module includes a background subtraction algorithm, based on the adaptive multiple Gaussian method. The above method of background subtraction is particularly useful for multi-modal background distributions.
Finally, from WO 2004/081875 A1, a system and a method for tracking a global form of a moving object is known, whereas one or more reference points along an initial contour of the global shape are defined, whereas each of said one or more reference points is tracked if the object is in motion and whereas the uncertainty of a location of an motion reference point is estimated. A form for the representation of uncertainty is a covariance matrix. When using a part-space form condition model, the uncertainty using a non-orthogonal projection and/or information fusion is exploited and each following contour is displayed. Wherein from WO 2004/081875 A1, which is a known system for optically tracking the movement of a shape of an object, one or more first color vectors are generated to represent contraction of reference points along the contour of the mold, there are generated one or more second reference vectors for represent dilation of reference points along the contour of the mold and for displaying the first and second color vectors periodically, thereby marking movement of the mold.
As the above appreciation of the prior art shows, a variety of tracking methods including further developments for object tracking, including the pursuit of the object shape is known. In this case, however, the orientation is calculated within the image plane and thus only 2-dimensional so that the shape of the object cannot be adapted to the actual movement of the object in three dimensional spaces.