Event detection is critical to any video analytics surveillance systems. Events are often location-dependent, and knowing where an event occurs is as important as knowing when it occurs. For example, during checkouts at a grocery store, the cashier repeatedly picks up items from the lead-in belt (pickup), scans them by a scanner for purchase (scan), and places them onto the take-away belt area (drop). The pickup-scan-drop sequences are repetitive, but the locations of pickup and drop operations can vary each time. This un-oriented interaction between the cashier's hand(s) and the belt area poses a problem for learning event models where features need to be extracted from some known location.
A large portion of event models are built to detect events at a pre-specified region of interest (ROI). However, one problem may arise in some scenarios when it comes to defining an appropriate ROI for the model. In the retail example mentioned above, the cashier may pick up (or place) products anywhere in the transaction area. An overly large ROI would include many irrelevant features from bagging activity and customer interventions, while an overly small region would miss many products that are presented outside of the region. In such an instance, one could use a sliding window to exhaustively test every possible location, but such an approach is extremely inefficient and normally requires a non-trivial post-process to merge similar detected results that are nearby.