Due to advances in video processing technology as well as the general increase in processing power available for a given cost and size, software is now available that is intended to examine live or recorded video and automatically recognize physical features in the video and determine the nature of objects appearing in the video, e.g., a car, an animal, a building, a human, etc. One well-publicized use of such technology is for automated recognition of individuals in video surveillance cameras by facial or other features. This technology, for instance, could be useful for automatically recognizing known terrorists or detecting abnormal or unusual activities and behaviors of people, vehicles and other objects of interest in airports and other public venues.
Another application of this technology is automatic target acquisition and surveillance in military operations.
The latest generation of automated video surveillance software has extended the technology to more than simply recognizing physical features, but also interpreting temporal qualities associated with those physical features (i.e., from frame to frame of the video) to recognize patterns of behaviors, events, and activities as well.
Techniques for classifying an object in a video sequence rely on information that can readily be gathered from an image or a sequence of images (i.e., a sequence of frames of a digital video) such as color, color continuity, size (e.g., number of pixels), motion, direction of motion, speed of motion, shape, etc. Naturally, information as to the distance between the camera and the object, i.e., range, would be extremely useful in algorithms for classifying detected objects because it would help in determining certain parameters such as speed and size that would be much more difficult to determine without range information.
For instance, a particular object might be identified by its contrast in hue relative to the background. Such an object may consume 25 pixels of the image and therefore have a size of 25 pixels. However, the sensed feature of the number of pixels occupied by the object provides essentially no information as to the actual physical size of the object unless the range to the object is known. For instance, an object that occupies 25 pixels within an image may correspond to the size of a car if the car is 100 meters from the camera. However, 25 pixels might also correspond to the size of a cat if the cat is 15 meters from the camera. A similar problem exists with respect to estimating the speed of an object. Obviously, a distant object moves more slowly through an image than an object moving at the same speed but closer to the camera.
Accordingly, some techniques have been developed for helping to determine or estimate the range of objects under surveillance. Such techniques include laser ranging, in which a laser range-finder is mounted very close to the camera to bounce a light beam off of objects in the surveillance area and measure the round trip delay in order to physically measuring the distance between the camera and the objects. Another known technique for determining range is stereo imaging. In stereo imaging two (or more) cameras observe the same surveillance volume from slightly different perspectives. The two (or more) simultaneous images of the same volume obtained from the cameras can be compared to each other and the range to the various objects in the images can be determined by triangulation.
A third technique, called passive ranging, can be used in connection with objects that are moving through the area under surveillance. Particularly, if the speed of a moving object is known or estimated, then its range can be estimated from the number of pixels by which it moves over a sequence of frames. For instance, if an object is moving perfectly sideways through the surveillance area and its speed is known, then the range can be calculated from the number of pixels it moves over a known time period (i.e., a known number of frames). Likewise, if an object is moving straight towards or straight away from the camera over a sequence of frames, its speed can be estimated by its change in size, particularly if the size of the object is known or estimated. Through more complex algorithms, it may be possible to accurately estimate range even with respect to objects whose size is not known and/or that have oblique motion through the surveillance area. However, generally, such factors as oblique motion and lack of knowledge as to size make the estimate much less reliable.
Accordingly, passive ranging generally is much less reliable and less accurate than laser ranging and triangulation in stereo imaging. Further, it can only be applied to moving objects (and then not particularly accurately if the speed, size, and/or direction of motion is not well known or predicted). A variation of this method is when the camera is moving at a known speed such as mounted on a vehicle (car, aircraft, boat, etc.). The optical flow can then be calculated and passive ranging to any point in the scene can be estimated.
Laser ranging and triangulation in stereo imaging, while been relatively accurate and reliable, is expensive. In laser ranging, a laser range finder must be supplied for every camera. In stereo imaging, there must be twice as many cameras. In addition, establishing stereo correspondence, for example, through dynamic programming, requires intense computations. Furthermore, retrofitting pre-existing video surveillance systems for laser ranging or stereo imaging is extremely labor-intensive.
Techniques for classifying objects in video can generally be characterized as falling into one of two types of techniques, namely: (1) sensed-feature-based classification and (2) physical-feature-based classification. Sensed-feature-based classification is based strictly on information that uses pixel as the measurement unit, e.g., pixel/frame, in an image or series of images, whereas physical-feature-based classification is based on information measured in physical standard or metric units, such as known or estimated speed, size or range in, for example, feet/sec, square meter, or meter, respectively. Thus, for instance, a sensed-feature-based classification algorithm might disclose that the size of an object in the image is 25 pixels. However, a physical-feature-based classification would indicate that the object is 2 feet tall.
It is an object of the present invention to provide a new and improved technique for estimating ranges to pixels and identifying and/or classifying objects of interest in video surveillance.