The present disclosure relates to a system and method for creating a spatiotemporal image representation with information content that enables accurate and rapid processing for detection of objects, events or activities. The disclosure finds particular application in connection with spatiotemporal (ST) image representation of a sequence of video images. However, it is to be appreciated that the present exemplary embodiments are also amendable to other like applications.
In the computer vision community, spatiotemporal processing techniques have recently gained more attention. Different methods have been proposed and studied by researchers for various applications.
Videos are usually viewed and processed by humans in a sequential mode, where each 2-D video frame (with 2 spatial dimensions) is played in a timely order. Traditional spatiotemporal representations of videos are usually obtained by stacking 1-D (spatial dimension) signals obtained from each 2-D video frame in their time orders. Therefore, the resulting 2-D ST images are composed by one spatial dimension and one time dimension. One traditional way to obtain the 1-D spatial dimension signals is to extract a row or column from the same location in each video frame, where the axis of the row or column becomes the spatial dimension. Successive frames yield successive extractions, which are time-order stacked to form a 2-D ST representation. Another known method for producing an ST representation from a video sequence projects (summarizes or averages) the whole or part of each 2-D video frame along one spatial dimension to reduce the image frame having 2-D spatial dimensions to a signal that is of one spatial dimension, and the sequence of 1-D signals resulting from the extractions are stacked to form the 2-D ST image.
Many existing applications using spatiotemporal representation of videos are focused on characterizing camera motion. One approach is directed at how to extract camera motion velocity that corresponds to an orientation in spatiotemporal space using a set of quadratic pairs of linear filters. Examples consist of rigid patterns moving constantly in one direction with no background clutter. Other approaches have relied on different algorithms to estimate camera motion in the spatiotemporal domain. In one approach, video tomography is used to extract lens zoom, camera pan and camera tilt information from a video sequence using the Hough transform to compute a linear camera model in the spatiotemporal images. A similar method analyzes a video sequence to characterize camera motion and involves determining a 2-D spatiotemporal representation wherein trace lines are determined by quantizing the 2-D spatiotemporal representation and finding boundaries between the quantized regions. Camera motion is then inferred by analyzing the pattern of the trace lines using Hough transforms.
Some other attempts have been directed at trying to detect moving objects in a constrained situation, where objects usually move at a constant velocity in front of a static camera. These algorithms often involve detecting straight lines or planes by using the Hough transform. In one example, the gait patterns generated by walking humans are analyzed using XT (spatiotemporal) slices. The Hough transform is then used to locate the straight line patterns in XT slices. In another example, a perceptual organization-based method is used to describe the motion in terms of compositions of planar patches in the 3-D spatiotemporal domain.
Yet another approach is directed at analyzing object motion in less constrained videos using spatiotemporal slices. This method involves using structure tensor analysis to first estimate the local orientation in each spatiotemporal slice. A 7-bin 2-D tensor histogram is then formed and the detected dominant motion is used as the background motion to reconstruct the background image in the spatial domain. Background subtraction is then used to roughly detect the foreground objects, and the results are further refined using color information.
More recently, researchers have started to use spatiotemporal features which are obtained directly from 3-D video volume to assist action/activity/behavior detection and recognition.
Other than applications in characterizing camera motion, detecting/tracking a moving object and representing local volume features, spatiotemporal related methods are also used in video stabilization, visual attention extraction, block matching, parked car detection and human counting, for example.
One application in which computer vision algorithms are becoming important is in video-based parking occupancy detection systems. Previous methods of parking occupancy detection have employed machine-learning based applications that typically involve an offline training phase that could be time consuming in acquiring vehicle and background samples. In addition, the on-line video processing phase of these methods can be slow when computational power is limited.
Many approaches for parking occupancy detection are directed to parking lot occupancy detection. Compared to other sensors that have been adopted in different occupancy detection applications (for example, ultrasonic sensors, laser scanners, radar/Lidar/ground radar sensors, magnetic field sensors, passive infrared sensors, microwave sensors, piezoelectric axle sensors, pneumatic road tubes, inductive loops, weight-in-motion systems and Vehicle Ad Hoc Networks (VANETs) based systems etc.) camera sensing has its advantages and disadvantages when used for occupancy detection. In general, cameras provide more abundant information of the parking lots/spaces than other sensors, which makes it possible to integrate other tasks, such as license plate detection/recognition, vehicle type classification, vandalism/loitering detection and law enforcement etc., in one system. Using cameras is also likely to reduce the costs of a system due to their wide sensing range. Convenience to install the systems is also one advantage of camera-based methods. However, camera-based systems usually require more computational resource than other sensor-based methods and hence it requires more energy supply. These requirements have generally prohibited large scale deployment of camera sensors for parking occupancy detection.
There are different video/image based parking occupancy detection methods that have been proposed. However, most of these methods are based on 2D spatial processing, which in general must accommodate more noise and variations comparing to processing in the reduced dimension domain—spatiotemporal-based processing. In the prior systems, the computer algorithms have not been fast and light enough to be embedded inside cameras. As such, video data have to be transferred to a data processing center. This process delays data processing and often it involves VPN rental to transfer data, which adds to operation cost. One day's video from one camera is on average about 10 GB. It takes a lot of network/storage resources to transfer every day's video data to the data processing center. Even after data has been transferred to a server, the processing speed of the above mentioned video-based occupancy detection is slow for large scale deployment. In one application, six computers are used for data storage and one server is used to process the data from three cameras. Data transmission delay is on average about 2 minutes. One server can barely process the data streams from three cameras in real-time (2.5 fps×3). This kind of centralized system is not cost efficient.