1. Field of the Invention
The present invention is a method and system for selecting and storing videos by applying semantically-meaningful selection criteria to the track sequences of the trips made by people in an area covered by overlapping multiple cameras, wherein video segments are selected based on semantically-meaningful and domain-specific selection criteria for the statistical behavior analysis and the size of stored video data is efficiently reduced.
2. Background of the Invention
In prior arts, such as areas for surveillance systems, there have been attempts to manage the stored data in a more intelligent way.
U.S. Pat. No. 5,689,442 of Swanson, et al. (hereinafter Swanson) disclosed a surveillance system that comprises an event sensor, a control processor, and an environment sensor. The event sensor captures images and sounds concerning events for storage, which is managed by a data management functionality. The environment sensor detects event conditions, such as temperature, speed, or motion, in the environment. Using the detected conditions, the control processor determines whether the images and sounds acquired by the sensor comprise an event of interest. A mode control functionality in the control processor processes the sensed conditions to identify the occurrence of events of interest. A data management functionality in the control process manages the stored information by selectively accessing the information and deleting the information that is not wanted to make room in the storage for subsequently captured information.
Although Swanson mentioned the movements as a criterion for identifying events, Swanson is foreign to the idea of using spatiotemporal constraints applied to the track sequences of people for selecting video segments in video streams that facilitate statistical behavior analysis. For example, Swanson mentioned examples of data management scheme in which older frames that are away from the events of interest are deleted from storage. In this approach, the term “time” is used to indicate how old the frames are in the storage, and it is entirely different from the temporal constraints that are applied to the track sequences in the real-world coordinate domain as noted in the present invention.
U.S. Pat. No. 5,969,755 of Courtney (hereinafter Courtney) disclosed a method to provide automatic content-based video indexing from object motion. Moving objects are detected in a video sequence using a motion segmentor. Then, Courtney extracted tracking information from the segmented video. Using tracking information, a motion analyzer analyzes object tracking information and annotates the motion graph with index marks describing eight events of interest, such as appearance/disappearance, deposit/removal, entrance/exit, and motion/rest of objects. Later, when a user gives a query as a form of spatiotemporal event or object-based event, Courtney retrieved the appropriate video clip from the video database.
Courtney is for a content-based video data retrieval system. Courtney noted that clips of the video identified by spatiotemporal queries, along with event and object-based queries are recalled to view the desired video. However, Courtney is foreign to the idea of storing video data based on selection criteria, including spatiotemporal constraints, which are applied to the tracking information of people in the video data, removing the video segments that do not contain tracking information from the video data, thus reducing the size of the stored video data. The eight events of interest defined for video indexing in Courtney are clearly different from spatiotemporal constraints and statistical behavior analysis.
U.S. Pat. No. 6,421,080 of Lambert (hereinafter Lambert) disclosed a method for use with a multi-camera surveillance system that provides pre-event recording. A recording system that records images only when particular events occur reduces the amount of video data by only storing images that have a high probability of being of interest. However, this event-triggered system does not begin until after the occurrence of one of the events. This insures that the period of time just preceding the event will never be recorded. Therefore, Lambert stored acquired video data into temporary storage that has a storage capacity large enough to store during the predefined period time. When the triggering event was detected, Lambert tried to associate video data temporally stored with triggered event video data. Lambert noted transactions at the point-of-sale (POS) terminal or automated teller machine (ATM), output signals from motion sensors or security alarms, and a control signal sent by a remote computer system as examples of triggering events.
Lambert disclosed a novel method for storing event-based video data in real-time. Lambert also disclosed a method for managing storage based on the triggering events. However, Lambert is foreign to the idea of analyzing and indexing the object on the video data, whereas in the present invention, a person's trip information is indexed to link to video data, where the trip information is selected based on the application of the spatiotemporal constraints. In other words, Lambert is foreign to the idea of selecting video segments with spatiotemporal trajectory analysis.
U.S. Patent Application Publication No. US 2005/0180603 of Zoghlami, et al. (hereinafter Zoghlami) disclosed a method and system for efficiently searching for events in a video surveillance sequence. Zoghlami tried to detect an object's appearance or disappearance in the presence of illumination changes or occlusion. In order to detect an event, Zoghlami extracted a series of snapshots of the video sequence at regular intervals as a sampling of the video sequence. Then, Zoghlami defined one or more windows-of-interest (WO1) in each snapshot, and measured the similarity in each WO1 in each snapshot. If the similarity exceeded the threshold, the snapshot was regarded as an event, after verifying that the snapshot does not include the occlusion.
Zoghlami is primarily to detect appearance or disappearance of an event based on the similarity measure, and Zoghlami is foreign to the idea of making decisions for events with trajectories of the object in the video data. Furthermore, Zoghlami is foreign to the idea of storing video data with sampling based on the semantics and application of spatiotemporal criteria on the track sequences detected in the video data. Instead, Zoghlami stores whole video data for searching the event.
U.S. Patent Application Publication No. 2007/0282665 of Buehler, et al. (hereinafter Buehler) disclosed a method of video surveillance systems for receiving data from multiple unrelated locations, performing various aggregation, normalization, and/or obfuscation processes on the data, and providing the summarized data to entities that have an interest in the activities at the store. Interesting activities can be determined by a set of rules that contain site-specific components and site-independent components. A site-specific component can specify locations about the sites, floor plan data, and sensor ID data. A site-independent component can specify actions occurring at the sites, objects placed about the sites or people interacting with objects about the site. Buehler further extended the idea of alerting occurrence of events, transmitting rules to local sites via a networking environment, and storing data, such as video surveillance data, the rules, and the results of analyses.
Basically, Buehler is a prior art that disclosed an event-based video surveillance system in retail space. Buehler is foreign to the idea of storing selected video data based on the semantic sampling and application of spatiotemporal criteria on the track sequences of people in the video data. Buehler only suggested a video storage module that stores video surveillance data and a rules/metadata storage module that stores the rules and metadata captured from the video surveillance system, without teaching any reduction of the video data size based on an application of spatiotemporal constraints.
El-Alfy, H., Jacobs, D., and Davis, L. 2007. Multi-scale video cropping. In Proceedings of the 15th international Conference on Multimedia (Augsburg, Germany, Sep. 25-29, 2007). MULTIMEDIA '07. ACM, New York, N.Y., 97-106, (hereinafter El-Alfy) disclosed a method that crops videos to retain the regions of greatest interest, while also cutting from one region of the video to another, to provide coverage of all activities of interest. El-Alfy chose a trajectory that a small sub-window can take through the video, selecting the most important parts of the video for display on a smaller monitor. The most important parts can be extracted by finding maximum motion energy. Then, a cropping window captured a salient object's movement continuously. This cropping window is later collaged with original video data so that the operator can watch not only the object's activities but also the whole video.
El-Alfy is also for helping an operator to detect a salient object in the videos and not related to a video storage system. EI-Alfy used a kind of spatial constraint for obtaining a salient object's movement. However, El-Alfy stored not only a cropping window but also whole video data in the storage. In other words, El-Alfy is foreign to the idea of storing the selected video data that satisfy predefined selection criteria, reducing the amount of stored videos in the database.
Girgensohn, A., Kimber, D., Vaughan, J., Yang, T., Shipman, F., Turner, T., Rieffel, E., Wilcox, L., Chen, F., and Dunnigan, T. 2007. DOTS: support for effective video surveillance. In Proceedings of the 15th international Conference on Multimedia (Augsburg, Germany, Sep. 25-29, 2007). MULTIMEDIA '07. ACM, New York, N.Y., 423-432, (hereinafter Girgensohn) disclosed a multi-camera surveillance system called Dynamic Object Tracking System (DOTS) for tracking people of interest. Girgensohn extracted people's trajectories and mapped into 2D or 3D model of background. Then, Girgensohn detected events and indexed them in the database. This event could be reminded later by the operator. Girgensohn also used face recognition for distinguishing tracking objects so that DOTS can later provide each person's tracking and event information. Girgensohn indexed and marked the event for review in a single or multi-camera surveillance system.
Although Girgensohn mentioned a technique that reduces the frame rate or quality of recorded video during less interesting periods, Girgensohn is foreign to the idea of completely removing the video segments that do not contain track sequences and the video segments in which the track sequences do not satisfy the predefined selection criteria, including spatiotemporal constraints, in a specific domain: The present invention proposes a novel approach of storage reduction based on an application of selection criteria, such as spatiotemporal constraints, to the track sequences of people in the real-world physical domain, which is different from compression by reducing the frame rates or quality of video.
Some prior arts for video retrieval systems show approaches for efficiently finding target contents in a database. However, they lack the idea of reducing the size of the stored video data in the database based on the application of spatiotemporal constraints to the track sequences of people that appear in the video data.
Pingali, G., Opalach, A., and Carlbom, I. 2001. Multimedia retrieval through spatio-temporal activity maps. In Proceedings of the Ninth ACM international Conference on Multimedia (Ottawa, Canada, Sep. 30-Oct. 5, 2001). MULTIMEDIA '01, vol. 9. ACM, New York, N.Y., 129-136, (hereinafter Pingali) disclosed a method for interactive media retrieval by combining a spatiotemporal activity map with domain-specific event information. The active map is a visual representation of spatiotemporal activity functions that can be computed from trajectories of an object's motion in an environment. Using the active map, Pingali expected that a user interacts more intuitively with the multimedia database to retrieve the media streams, trajectories, and statistics of trajectories.
Pingali is basically for providing an efficient method for retrieving the target content using the activity maps. Although Pingali is an exemplary approach that tries to provide a solution to identify the relevant content and retrieve them in very large quantities of multimedia data, such as videos captured by multiple video cameras, Pingali is foreign to the storage scheme of compact video streams by removing video segments that do not contain track sequences of people or moving objects from multiple video streams.
Chen, X. and Zhang, C. 2007. Interactive mining and semantic retrieval of videos. In Proceedings of the 8th international Workshop on Multimedia Data Mining: (Associated with the ACM SIGKDD 2007) (San Jose, Calif., Aug. 12-12, 2007). MDM '07. ACM, New York, N.Y., 1-9, (hereinafter Chen) disclosed an interactive framework for semantic surveillance mining and retrieval. Chen extracted content features and the moving trajectories of objects in the video. After that, general user-interested semantic events are modeled. Based on the modeling of events, Chen tried to retrieve semantic events on the video data by analyzing the spatiotemporal trajectory sequences. Retrieval of events is dynamically learned by interacting with the user.
Chen is also foreign to the idea of storing the selected event-based videos after applying spatiotemporal constraints to the track sequences of people in the videos, and Chen is not related to a multi-camera system.