A. Field of the Invention
The invention is a novel method for recognition of human motions. The invention also applies this method for vector sequence recognition and for speech recognition.
B. Description of Prior Art
Recognition of human motion and especially recognizing detailed human activities, is a relatively new research area with very few published works. A paper by Ben-Arie, J., Wang, Z., Pandit, P. and Rajaram, S., “Human Activity Recognition Using Multidimensional Indexing,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 24, No. 8, pp. 1091-1105, August 2002 is among these few. An additional excellent reference which conducts an intensive survey on the different methodologies for visual analysis of human movement is in D. M. Gavrila. “The visual analysis of human movement: A survey,” Computer Vision and Image Understanding, 73(1):82-98, 1999. Gavrila groups them into 2D approaches with or without explicit shape models and 3D approaches. The 2D approach without explicit shape models is based on describing human movement in terms of simple low-level 2D features instead of recovering the pose. The second approach, which is a view-based approach uses explicit shape models to segment, track and label body parts. The third approach attempts to recover the 3D poses over time. More recently, there has been a survey by T. B. Moeslund and E. Granum “A survey of computer vision-based human motion capture,” Computer Vision and Image Understanding, 81(3):231-268, March 2001, which describes various computer vision-based human motion capture. They elaborate about the various categories of human motion capture namely Initialization, Tracking, Pose Estimation and Recognition. Human motion recognition is classified into static and dynamic recognition. Static recognition is based on using spatial data, one frame at a time and dynamic recognition uses the temporal characteristics of the action. Our method of Recognition Indexing & Sequencing (RISq) is based on a novel approach that differs from all the methods surveyed above.
Unlike our method, which can classify many different activities, past works focused on recognition of only few activity classes. H. Fujiyoshi and Alan J. Lipton in “Real-time human motion analysis by image skeletonization,” Proc. of the Workshop on Application of Computer Vision, October 1998, use skeletonization to extract internal human motion features and to classify human motion into “running” or “walking” based on the frequency analysis of the motion features.
M-H. Yang and N. Ahuja. in “Recognizing hand gesture using motion trajectories,” IEEE Conference on Computer Vision and Pattern Recognition, pages 466-472, June 1999, apply Time-Delay Neural Network (TDNN) to hand gesture recognition and achieve quite high recognition rate a method akin to Dynamic Time Warping (DTW). DTW was also used by works such as: Sakoe H., Chiba S., in “Dynamic Programming Optimization for Spoken Word Recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1):4349, 1978.J. Schlenzig, E. Hunter, and R. Jain in “Vision based hand gesture interpretation using recursive estimation,” Proceedings of the 28th Asilmoar Conference on Signals, Systems and Computers, 1994, use Hidden Markov Model (HMM) and a rotation-invariant imaging representation to recognize visual gestures such as “hello” and “good-bye”. HMMs are used by J. Yamato, J. Ohya, and K. Ishii in “Recognizing human action in time-sequential images using hidden markov model,” in Proceedings Conference on Computer Vision and Pattern Recognition, pages 379-385, June 1992, for recognizing human action in time sequential images. HMMs were also utilized by Starner and Pentland to recognize American Sign Languages (ASL). Darrell and Pentland applied dynamic time warping to model correlation for recognizing hand gestures from video. R. Polana and R. Nelson in “Recognizing activities” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pages 815-818, 1994, use template matching techniques to recognize human activity.
Motion Energy Images are used by A. F. Bobick and J. W. Davis in “An appearance based representation of action” in Proc. of Thirteenth International Conference on Pattern Recognition, August 1996, for recognition. I. Haritaoglu, D. Harwood, and L. S. Davis in “w4: Real-time surveillance of people and their activities” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):809-830, August 2000, implemented a system for human tracking and activity recognition, in which the activity recognition part is mainly based on analysis of the projected histograms of detected human silhouettes. This system classifies human poses in each frame into one of four main poses (standing, sitting, crawling/bending, lying) and one of three view-based appearances (front/back, left-side and right side) and activities are monitored by checking the pose changes over time.
In another work, Y. A. Ivanov and A. F. Bobick in “Recognition of visual activities and interactions by stochastic parsing” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):852-871, August 2000, recognize generic activities using HMM and stochastic parsing. These activities are first detected as a stream of low level action primitives represented using HMM and then are recognized by parsing the stream of primitive representations using a context-free grammar. A. F. Bobick and J. W. Davis in “The recognition of human movement using temporal templates” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), March 2001 recognized human activity by matching temporal templates against stored instances of views of known actions.
More recently, Galata et al. [9] A. Galata, N. Johnson, and D. Hogg in “Learning variable-length Markov models of behavior” in Computer Vision and Image Understanding, 81(3):398-413, March 2001, use Variable Length Markov Models (VLMM) for modeling human behavior. They use VLMMs because of their more powerful encoding of temporal dependencies. Our indexing based recognition RISq approach differs from all the above-mentioned works since it determines the best matching activity in a single indexing operation, is invariant to activity's speed and requires only few very sparse samples of the activity for complete recognition.
String matching techniques that allow recognition of input strings as similar to stored collection of model strings stored in a database, are also described in several patents such as U.S. Pat. No. 5,577,249 or in U.S. Patent Application number 20020181711. However, they significantly differ in their method from our method. U.S. Pat. No. 5,577,249 employs random partitioning for the recognition and U.S. Patent Application number 20020181711 finds similarity by generating K-means cluster signatures.
Other patents that relate to human motion recognition are mostly adapted to recognition of simple actions such as hand gestures as an input to a computer etc. and not for recognition of human motion such as articulated human activity. They also have recognition methods that substantially differ from our method. Such are U.S. Pat. Nos. 6,256,033, 5,581,276, 5,714,698 and 6,222,465. U.S. patents that relate to indexing and speech recognition such as: U.S. Pat. Nos. 5,386,492, 5,502,774, 5,621,809, 6,292,779, 6,542,869, 6,371,711, and U.S. patent application 20020052742 or 20020062302 apply methods that are different from our invention.
The prior art fails to provide satisfactory solutions to a cardinal problem in the recognition of human motion. Human motion has noticeable variations in speed that occur all the time. The speed varies between person to the next and even during performance of the same person. All the prior methods are based on equally fast rate of sampling of both input and model motions. Prior recognition algorithms take into account the temporal properties of the motion since they rely on temporal correlation as the basic principle of matching, whether the algorithm is continuous or discrete. The only prior method that tried to cope with speed variations was Dynamic time warping (DTW) this is a dynamic programming technique used to create nonlinear warping function between the input time axis and the model time axis. DTW is substantially different from our method requires a lot of computations that not always converge and also is still quite sensitive to large variations of speed. In contrast, our invention proposes a method that eliminates entirely the time factor from the recognition and replaces it with sequencing. Another innovation of our approach, i.e. the substantially different rate of sampling, enables to recognize human motions just from very few samples. This result was verified by us experimentally. Similar advantages are expected form the application of the method towards speech recognition.