Automatically identifying the locations of objects and their parts in video is important for many tasks. For example, in the case of human body parts, automatically identifying the locations of human body parts is important for tasks such as automated action recognition, human pose estimation, etc. Body parsing is a term used to describe the computerized localization of individual body parts in video. Current methods for body parsing in video estimate only part locations such as head, legs, arms, etc. See e.g., “Strike a Pose: Tracking People by Finding Stylized Poses,” Ramanan et al., Computer Vision and Pattern Recognition (CVPR), San Diego, Calif., June 2005; and “Pictorial Structures for Object Recognition,” Felzenszwalb et al., International Journal of Computer Vision (IJCV), January 2005.
Most previous methods in fact only perform syntactic object parsing, i.e., they only estimate the localization of object parts (e.g., arms, legs, face, etc.) without efficiently estimating semantic attributes associated with the object parts.
In view of the foregoing, there is a need for a method and system for effectively identifying semantic attributes of objects from images.