Driven by recent advances in digital video camera design, video management systems, and efficient archiving, security at large publicly accessible facilities and urban sites increasingly deploys comprehensive video surveillance systems. Such systems support both the ability to monitor wide areas in real time and the ability to conduct forensic reviews after an incident or a tip. Most analysis of surveillance video for security purposes requires a user or users (e.g., operators or investigators) to search for particular types of video content (e.g., specific actions or behaviors, or vehicles, objects, or persons fitting a given description). While the fully attentive human visual system is very adept at interpreting video content, it is also typically limited in capacity to reviewing one camera view at a time at speeds near real-time, depending on the level of activity in the scene. As a result, searching over large amounts of surveillance video can be a slow and cumbersome process. In addition, the human attention span does not persist at full capacity over long time periods. This is the motivation for automated video search technology, which can direct the attention of security personnel or investigators to potentially useful video content.
Research into video-based automated searching for persons often focuses on biometric recognition using signatures that can be captured at a distance, such as gait or especially face. There have been several recent attempts to incorporate appearance descriptions of persons into video search techniques, including so-called “soft biometrics.” Some techniques localize persons and frontal faces in video using standard Haar feature cascade classifiers, and focus primarily on the characterization of facial attributes for images captured at relatively close range to the camera (e.g., detection of facial hair and glasses, or classification of gender and ethnicity based on facial features).
However, face recognition systems can only be employed when surveillance cameras capture face images at sufficient resolution and illumination. In fact, experimental studies indicate that face recognition performance begins to degrade at (compressed) image resolutions as high as 90 pixels between the eyes, which is much greater resolution than typically provided by surveillance video, unless cameras are setup with narrow fields of view (e.g., monitoring specific doorways at close range). For video cameras covering large areas and at a distance from subjects (on the order of tens of meters), analysis of faces may be unreliable. Further, biometric recognition often requires a database of prior enrollment records against which to perform identification, which would not necessarily be available when security personnel or investigators are working with an eyewitness description of a person of interest. In addition, traditional tracking systems attempt to estimate a person's position frame-by-frame, making these systems susceptible to failure in crowded environments due to the many sources of occlusion and clutter (e.g., nearby people in a crowd).