1. Field of the Invention
The present invention is a method and system for storing videos by track sequences and selection of video segments by indexing and playback of individual visitors' entire trip to an area covered by the overlapping cameras, wherein the final storage format of the videos is a trip-centered format, which sequences videos from across multiple cameras in a manner to facilitate multiple applications dealing with detailed behavior analysis, and an efficient compact format without losing any video segments that contain the track sequences of the people.
2. Background of the Invention
In the video analysis areas, especially in the video-based behavior analysis application areas, the need for a novel method of efficient video storage and access has been shown.
For example, Mongy, S., Bouali, F., and Djeraba, C. 2005. Analyzing user's behavior on a video database, in the Proceedings of the 6th International Workshop on Multimedia Data Mining: Mining integrated Media and Complex Data (Chicago, Ill., Aug. 21-21, 2005). MDM '05. ACM, New York, N.Y., 95-100, (hereinafter Mongy) addresses the need for an intelligent software system that fully utilizes information in users' behavior in interacting with large video databases. Mongy disclosed a framework that combines intra-video and inter-video usage mining to generate a user profile on a video search engine for video usage mining.
However, it needs to be clearly discerned that the topics in Mongy are not for mining the video content itself, but rather for mining the usage of videos by users, i.e., video usage behavior analysis. Therefore, Mongy is entirely foreign to the idea of how to create an efficient video storage, especially based on track sequences of people or moving objects in multiple video streams, in order to facilitate the behavior analysis of the people and moving objects in the video streams.
U.S. Pat. No. 7,383,508 of Toyama, et al. (hereinafter Toyama) disclosed a method of a video abstraction system for displaying short segments of video called a cliplet that represents a single theme or event. The cliplet is a semantically meaningful portion of video containing what a viewer might consider a single short event. The cliplet commonly consists of 3 to 10 seconds of partial source video. A boring and uninteresting section of the source may be excluded altogether. Toyama attaches the properties that can be obtained by voice or speaker recognition, face detection, and zoom or pan detections to each cliplet. The properties then help a user to find a specific cliplet. The computer user interface displays the cliplet and the information and properties of each cliplet.
Toyama is basically for providing a more efficient method of editing the source video data. The cliplet is for fast reviewing of the source video data, so that it can help the user to edit the video data easily. In Toyama, there is no source video modification. Toyama also did not explicitly teach how to connect multiple cliplets in a sequence based on the tracking of people or moving objects. Toyama is foreign to the idea of tracking the individual object in video data.
Toyama is further foreign to the idea of distinguishing the video segments that show track sequences of people or moving objects from the other video segments that do not contain such track sequences. Toyama also explicitly teaches the usage of a constraint application module that limits the minimum and maximum duration of a cliplet, whereas there is no such time constraint for the trip video segments in the present invention. Toyama is also entirely foreign to the idea of creating a collage of video segments from multiple video streams based on a tracking of people or moving objects in the video streams, where each video segment is just a part of a single or multiple events.
U.S. Patent Application Publication No. 2005/0271251 of Russell, et al. (hereinafter Russell) disclosed a method of managing video data storage in a video surveillance system. Russell tried to determine the importance of video data based on decision criteria, such as rules, configuration data and preference, to support intelligent automatic reduction of stored surveillance data. Russell deleted, compressed, or archived the images and video data that are determined as less important. A decision criterion includes specific object recognition or event detecting. Russell mentioned a human detection as a case of event detection. Additionally, Russell stored the event metadata at an event database. This metadata may include event information such as time, location, type of event and potentially an identification of a person or object.
However, Russell did not use event metadata for making tracking sequences of an object in order to facilitate behavioral pattern analysis of each object like the present invention. Furthermore, Russell is foreign to the idea of compacting video segments based on the trip information of the people or moving objects and creating a compact video stream out of multiple video streams in the way the present invention teaches.
Therefore, it is one of the objectives of the present invention to provide a novel video format that creates a collage of video segments from multiple video streams, based on a tracking of people or moving objects in the video streams, in order to facilitate video annotation or editing for behavior analysis.
Yi, H., Rajan, D., and Chia, L. 2004. A motion based scene tree for browsing and retrieval of compressed videos. In Proceedings of the 2nd ACM international Workshop on Multimedia Databases (Washington, D.C., USA, Nov. 13-13, 2004). MMDB '04. ACM, New York, N.Y., 10-18, (hereinafter Yi) disclosed a method for browsing and retrieving compressed video by sequentially comparing the similarity of the key frames, based on the steps of shot boundary detection, usage of a browsing hierarchy, and video indexing. Yi is primarily concerned with browsing and retrieval of compressed video, and Yi is entirely foreign to the idea of compacting video segments in multiple video streams and creating a collage of the video segments collected from the multiple video streams, based on the tracking of people or moving objects in the fields of view of multiple cameras.
Especially, Yi teaches the idea of using a scene tree hierarchy as a structure to facilitate browsing, where the main idea of the scene tree hierarchy is to sequentially compare the similarity of the key frames between the current shot and previous shots. Yi does not compare the similarity of the key frames that belong to different video streams. In the present invention, one of the objectives is to create a collage of video segments that contains trip information of the people or moving objects from multiple video streams, so that the collage further facilitates a behavior analysis process in a serialized and logically-organized collection of video segments.
Chen, X. and Zhang, C. 2007. Interactive mining and semantic retrieval of videos. In Proceedings of the 8th international Workshop on Multimedia Data Mining: (Associated with the ACM SIGKDD 2007) (San Jose, Calif., Aug. 12-12, 2007). MDM '07. ACM, New York, N.Y., 1-9, (hereinafter Chen) disclosed a method for detecting and retrieving semantic events from surveillance videos. Chen discussed the problem of sequential browsing of a large amount of video as time consuming and tedious for users. The goal of Chen is to retrieve semantic events by analyzing the spatiotemporal trajectory sequences. Chen pointed out that, since individual users may have their own subjective query target, the analysis criteria may be ambiguous while trying to satisfy each individual. Therefore, Chen tried to use the interaction process with users, which is called Relevance Feedback, to learn the user's interest dynamically. Chen showed a method for semantic video abstraction and retrieval that allowed a user to skip an uninteresting part of the video and to detect a semantically meaningful part by object tracking and modeling the trajectories of semantic objects in videos.
However, Chen is foreign to the novel data storage format of compact video streams by removing video segments that do not contain track sequences of people or moving objects from multiple video streams.
Jiang, H. and Elmagarmid, A. 1998. Spatial and temporal content-based access to hypervideo databases. The VLDB Journal 7, 4 (December 1998), 226-238, (hereinafter Jiang) disclosed a method for a content-based video retrieval system based on spatial and temporal relationship between objects in a video database. Using a logical hypervideo data model, i.e., multilevel video abstractions, and the semantic associations of the multilevel video abstractions with other logical video abstractions, Jiang tried to retrieve semantically described video data with predefined spatiotemporal query languages. The multilevel video abstractions in the logical hypervideo data model include video entities that users are interested in, defined as hot objects. Semantic association in Jiang included not only an object's spatial and temporal relationship obtained automatically by object tracking, but also manual annotation of each object. For more efficient browsing and retrieving than sequential browsing, Jiang used a video hyperlink that is a connection among logical hypervideo data based on semantic association.
However, similar to Chen, Jiang is foreign to the novel data storage format of compact video streams by removing video segments that do not contain track sequences of people or moving objects from multiple video streams for the purpose of facilitating behavior analysis.
The benefits of the present invention can be utilized in many application areas. For example, U.S. Pat. No. 6,496,856 of Kenner, et al. (hereinafter Kenner) disclosed a method and architecture of a video clip storage and retrieval system that has a capability of handling a large number of users at a low cost. Kenner also suggested a new interface between users and a video retrieval system. The typical video on demand (VOD) architecture uses a centralized networking computer system. This architecture, however, is limited or partially “distributed,” which links multiple personal computers together in order to fashion a much large monolithic functional unit. When a user requests a video clip, the Primary Index Manager sends the request to the other local search unit. If a desired video clip is found, Kenner creates a Data Sequencing Interface with the user's local search unit that is connected to the user's terminal, and sends the found video clip.
Kenner is just VOD architecture and explained a method for avoiding high congestion at the VOD server. Kenner is foreign to the idea of storing reduced size video streams based on track sequences of people or moving objects for the application of behavior analysis. A VOD system like Kenner can utilize the benefits of the present invention in order to reduce the video storage size and deliver the video clips that are truly demanded by the users, if the user's goal for the retrieval is to find video clips that are used for a behavior analysis of the people or moving objects in an application area of behavior analysis.
In a preferred embodiment, the present invention can primarily benefit various types of applications for analyzing human and moving object behaviors. For example, the present invention can facilitate the behavior analysis of people or moving objects in a retail store or surveillance application.