Traffic counts, speeds and vehicle classification are fundamental parameters for a variety of transportation projects ranging from transportation planning to modern intelligent transportation systems. Most intelligent transportation systems are designed using readily available technology (e.g. sensors and communication), such as an inductive loop detector. Other sensing technologies include radar, infrared, lasers, ultrasonic sensors and magnetometers.
Among the many technologies, vision-based systems are emerging as an attractive alternative due to their ease of installation, inexpensive maintenance, and ability to capture a rich description of the scene. In principle, video provides not only aggregate information such as average speed, vehicle counts, and queue lengths, but also individual parameters such as trajectories, individual speeds, and classification.
Existing vision systems typically place cameras high above the ground, anywhere from 15 m to 100 m, to provide a bird's eye view of the road. At such a high vantage point, the appearance of a vehicle does not change significantly over time, and thus occlusion between vehicles is considerably reduced, thus simplifying the problem. However, placing cameras at such heights is not always possible. In non-urban areas the required infrastructure is cost prohibitive, and for transient traffic studies, the expensive mounting equipment and strategic placement of cameras are precluded by a lack of long-term commitment.
The accuracy of vision systems is compromised if the cameras are mounted too low or have poor perspective views of traffic. When the camera is high above the ground and near the center of the road, a homography can be defined to map the road surface to the image plane, and the height of vehicles can be safely ignored because their appearance does not change significantly over time. In contrast, when the camera is at a low angle and/or off centered from the road, the vehicle height causes significant occlusion. A single homography (under the flat world assumption) may not suffice because feature points on a vehicle may spill over into neighboring lanes.
Various approaches to vision based tracking systems have been proposed. These approaches include, for example, blob tracking, active contour tracking, 3D model based tracking, color and pattern-based tracking and tracking using point features (feature tracking). A feature tracking approach is described in Beymer et al. (D. Beymer, P. McLauchlan, B. Coifman, and J. Malik, “A real time computer vision system for measuring traffic parameters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1997, pp. 495-501). Beymer et al. describes a system that tracks features throughout the video sequence, then groups the features according to motion cues in order to segment the vehicles. Because the camera is high above the ground, a single homography is sufficient to map the image coordinates of the features to the road plane, where the distances between pairs of features and their velocities are compared. For proper grouping, the features need to be tracked over the entire detection zone which is often not possible when the camera is not looking top-down due to the significant scale changes and occlusions. In another approach, Saunier et al. (N. Saunier and T. Syed, “A feature-based tracking algorithm for vehicles in intersections,” in Proceedings of the 3rd Canadian Conference on Computer and Robot Vision, 2006) use feature points to track vehicles through short-term occlusions, such as poles or trees. The above approaches have difficulty initializing and tracking partially occluded vehicles. Moreover, these approaches apply to cameras that are mounted relatively high above the ground. At such heights, the problems of occlusion and vehicle overlap are mitigated, thus making feature tracking easier.
A method for segmenting and tracking vehicles in low angle frontal sequences has been proposed by Kamijo et al. (S. Kamijo, Y. Matsushita, K. Ikeuchi and M. Sakauchi, “Traffic monitoring and accident detection at intersections,” IEEE Transactions on Intelligent Transportation Systems, vol. 1, no. 2, pp. 108-118, June 2000). Under this approach, the image is divided into pixel blocks and a spatiotemporal Markov random field is used to update an object map using the current and previous image. One drawback of this approach is that it does not yield 3D information about vehicle trajectories in the world coordinate system. In addition, in order to achieve accurate results the images in the sequence are processed in reverse order to ensure that vehicles recede from the camera. The accuracy decreases by a factor of two when the sequence is not processed in reverse, thus making the approach unsuitable for on-line processing when time-critical results are required.
In two previous publications, Kanhere I (N. K. Kanhere, S. J. Pundlik, and S. T. Birchfield, “Vehicle segmentation and tracking from a low-angle off-axis camera,” in Proceedings of the IEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2005, pp. 1152-1157) and Kanhere II (N. K. Kanhere, S. T. Birchfield, and W. A. Sarasua, “Vehicle segmentation and tracking in the presence of occlusions,” in TRB Annual Meeting Compendium of Papers, Transportation Research Board Annual Meeting, January 2006), the applicants presented a method for visually monitoring traffic when the camera is relatively low to the ground and on the side of the road where occlusion and perspective effects due to heights of vehicles cannot be ignored. Under this previous approach, stable features were detected and tracked throughout an image sequence and then grouped together using a multilevel homography, which is an extension of the standard homography to the low-angle situation. Using a concept known as the relative height constraint, the 3D height in the world coordinate system of feature points on vehicles were estimated from a single camera. The method discussed in these two publications required a computationally intensive batch processing of image frames and could not process images in real time. Moreover, the processes discussed in these previous publications were incapable of performing vehicle classifications (e.g. car, truck, etc.).
Thus a need exists for a system and method for vision based tracking of vehicles using cameras mounted at low heights that overcomes the limitations of the above methods and systems and that can process images incrementally in real time without being affected by spillover, occlusion, and shadows. The system and method should be able to work in dense traffic and other lighting and weather conditions.
Another problem in vision based tracking systems involves the calibration of cameras used to obtain visual data. Camera calibration is an essential step in vision based vehicle tracking to measure speeds and to improve the accuracy of tracking techniques for obtaining vehicle counts. Due to the dependence in some systems of camera height and position, calibration procedures must be performed in order to detect vehicles whenever a camera is set up or moved. This may preclude full use of movable cameras, such as “pan-tilt-zoom” cameras (PTZ cameras), since each time the camera view is varied, the system must be re-calibrated.
Automatic calibration would not only reduce the tediousness of installing fixed cameras, but it would also enable the use of PTZ cameras without recalibrating whenever the camera moves. Dailey et al. (D. Dailey, F. W. Cathy, and S. Pumrin, “An algorithm to estimate mean traffic using uncalibrated cameras,” in Proceedings of the IEEE Conference on Intelligent Transportation Systems, pages 98-107, 2000) relates pixel displacement to real-world units by fitting a linear function to scaling factors obtained using a known distribution of the typical length of vehicles. Sequential image frames are subtracted, and vehicles are tracked by matching the centroids of the resulting blobs. At low camera heights, the resulting spillover and occlusion cause the blobs to be merged. Schoepflin and Dailey (Todd N. Schoepflin and Daniel J. Dailey, “Dynamic calibration of Pan-Tilt-Zoom Cameras for Traffic Monitoring,” in IEEE Transactions on Intelligent Transportation Systems, Vol. 4(2), pages 90-98, June 2003) dynamically calibrate PTZ cameras using lane activity maps which are computed by frame-differencing. Under this approach, spillover is a serious problem for moderate to large pan angles, and this error only increases with low camera heights. Estimating lanes using activity maps is impossible with pan angles as small as 10° when the camera is placed 20 feet above the ground due to the large amount of spillover and occlusion that occurs.
In an alternate approach, Song et al. (Kai-Tai Song and Jen-Chao Tai, “Dynamic calibration of roadside traffic management cameras for vehicle speed estimation,” in IEEE Transactions on Systems, Man, and Cybernetics, Vol. 36(5), October 2006) uses edge detection to find the lane markings in a static background image, from which the vanishing point is estimated by assuming that the camera height and lane width are known in advance. This method requires the lane markings to be visible, which may not be true under poor lighting or weather conditions. In addition, estimating the static background is not always possible when traffic is dense as it requires time to acquire a good background image. Moreover, background subtraction does not work well with low camera heights due to occlusion and spillover, as noted above.
Thus a further need exists for a system and method for automatically calibrating a camera mounted at a low angle to the road that overcomes the limitations of the above methods and that does not require pavement markings or prior knowledge of the camera height or lane width, is unaffected by spillover, occlusion, and shadows, and works in dense traffic and other lighting and weather conditions.