To detect events in videos is necessary to interpret “semantically meaningful object actions,” A. Ekinci, A. M. Tekalp, “Generic event detection in sports video using cinematic features,” Proc. IEEE Workshop on Detection and Recognizing Events in Video, 2003. To perform ‘action’ or event detection, a gap between numerical features of objects and symbolic description of meaningful activities needs to be bridged. Prior art event detection methods generally extract trajectories of features from a video, followed by supervised learning.
For example, one method is based on view-dependent template matching, J. Davis and A. Bobick, “Representation and recognition of human movement using temporal templates,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997. There, action is represented by a temporal template, which is determined from accumulative motion properties at each pixel in a video.
Another method detects simple periodic events, e.g., walking, by constructing dynamic models of periodic patterns of human movements. L. Davis, R. Chelappa, A. Rosenfeld, D. Harwood, I. Haritaoglu, and R. Cutler, “Visual Surveillance and Monitoring,” Proc. DARPA Image Understanding Workshop, pp. 73-76, 1998.
Distributions of object trajectories can also be clustered, N. Johnson and D. Hogg, “Learning the distribution of object trajectories for event recognition,” Proc. British Machine Vision Conference, pp. 583-592, 1995. A hierarchy of similar distributions of activity can also be estimated using co-occurrence feature clustering, C. Stauffer and W. E. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8), pp. 747-757, 2000.
Events can be defined as temporal stochastic processes to provide a segmentation of a video, L. Zelnik-Manor and M. Irani, “Event-Based Video Analysis,” IEEE Conf. Computer Vision and Pattern Recognition, December 2001. Their dissimilarity measure is based on a sum of χ2 divergences of empirical distributions, which requires off-line training, and the number of clusters is preset.
A hidden Markov model (HMM) can represent a simple event and recognize the event by determining the probability that the model produces a visual observation sequence, T. Starner and A. Pentland, “Visual recognition of American sign language using hidden Markov models,” Proc. Int'l Workshop Automatic Face—and Gesture-Recognition, 1995.
A HMM can also be used for detecting intruders, V. Kettnaker, “Time-dependent HMMs for visual intrusion detection,” Proc. IEEE Workshop on Detection and Recognizing Events in Video, 2003.
Prior art HMM-based methods generally require off-line training with known events before the events themselves can be detected. However, it is not foreseeable that every possible event can be known beforehand. Furthermore, the same events can vary among different applications. Thus, modeling and detecting events is a difficult problem.
A number of other event detection methods are known, A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Proc. of Neural Information Processing Systems, 2001, M. Meila and J. Shi, “Learning Segmentation by Random Walks,” Proc. Advances in Neural Information Processing Systems, 2000, Z. Marx, I. Dagan, and J. Buhmann, “Coupled Clustering: a Method for Detecting Structural Correspondence,” Proc. International Conference on Machine Learning, pp. 353-360, 2001, S. Kamvar, D. Klein, and C. Manning, “Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based Approach,” Proc. ICML, 2002, and M. Fiedler, “A property of eigenvectors of non-negative symmetric matrices and its application to graph theory,” Czeckoslovak Mathematical Journal, 25: pp. 619-672, 1975.
However, those methods address different issues. For instance, Ng et al., use K-means clustering. They do not consider a relation between an optimal number of clusters and a number of largest eigenvectors. Meila et al. extend the method of Ng et al. to generalized eigenvalue representation. Although they use multiple eigenvectors, the number of eigenvectors is fixed. Kamvar requires supervisory information, which is not always available. Marx et al. use coupled-clustering with a fixed number of clusters. A big disadvantage of these methods is that they are all limited to trajectories duration of equal lengths because they depend on correspondences between coordinates.
The extraction of trajectories of objects from videos is well known. However, very little work has been done on investigating secondary outputs of a tracker. One method uses eight constant features, which include height, width, speed, motion direction, and the distance to a reference object, G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, “Event detection and analysis from video streams,” IEEE Trans. on PAMI, 23(8), 873-889, 2001. Visual features can also be considered, see Zelnik et al., and Stauffer et al. Zelnik et al. use spatiotemporal intensity gradients at different temporal scales. Stauffer et al. use co-occurrence statistics of coordinate, speed and size. However, prior art trajectory-based features are insufficiently expressive to detect many events.
Therefore, it is desired to provide more expressive features, which can be used to detect events that normally cannot be detected using conventional features. Furthermore, it is desired to provide a method that uses an unsupervised learning method.