1. Field of the Invention
The present invention is a method and system for automatically analyzing the trip of people in a physical space, such as a retail space, by capturing a plurality of input images of the people by a plurality of means for capturing images, tracking the people in each field of view of the plurality of means for capturing images, mapping the trip on to the coordinates of the physical space, joining the plurality of tracks across the multiple fields of view of the plurality of means for capturing images, and finding information for the trip of the people based on the processed results from the plurality of tracks.
2. Background of the Invention
Shoppers' Behavior Analysis
There have been earlier attempts for understanding customers' shopping behaviors captured in a video in a targeted environment, such as in a retail store, using cameras.
U.S. Pat. Appl. Pub. No. 2006/0010028 of Sorensen (hereinafter Sorensen 2006/0010028) disclosed a method for tracking shopper movements and behavior in a shopping environment using a video. In Sorensen 2006/0010028, a user indicated a series of screen locations in a display at which the shopper appeared in the video, and the series of screen locations were translated to store map coordinates. The step of receiving the user input via input devices, such as a pointing device or keyboard, makes Sorensen 2006/0010028 inefficient for handling a large amount of video data in a large shopping environment with a relatively complicated store layout, especially over a long period of time. The manual input by a human operator/user cannot efficiently track all of the shoppers in such cases, not to mention the possibility of human errors due to tiredness and boredom. Also, the manual input approach is not scalable according to the number of shopping environments to handle.
Although U.S. Pat. Appl. Pub. No. 2002/0178085 of Sorensen (hereinafter Sorensen 2002/0178085) disclosed the usage of a tracking device and store sensors in a plurality of tracking systems primarily based on the wireless technology, such as the RFID, Sorensen 2002/0178085 is clearly foreign to the concept of applying computer vision-based tracking algorithms to the field of understanding customers' shopping behaviors and movements.
In Sorensen 2002/0178085, each transmitter was typically attached to a handheld or push-type cart. Therefore, Sorensen 2002/0178085 cannot distinguish multiple shoppers' behaviors using one cart from the behavior of a single shopper also using one cart. Although Sorensen 2002/0178085 disclosed that the transmitter may be attached directly to a shopper via a clip or other form of customer surrogate in order to help in the case where the customer is shopping without a cart, this will not be practical due to the additionally introduced cumbersome step to the shopper, not to mention the inefficiency of managing the transmitter for each individual shopper.
U.S. Pat. No. 6,741,973 of Dove, et al. (hereinafter Dove) disclosed a model of generating customer behavior in a transaction environment. Although Dove disclosed video cameras in a real bank branch as a way to observe the human behavior, Dove is clearly foreign to the concept of automatic and real-time analysis of the customers' behaviors based on visual information of the customers in a retail environment, such as shopping path tracking and analysis.
U.S. Pat. Appl. Pub. No. 2003/0053659 of Pavlidis, et al. (hereinafter Pavlidis) disclosed a method for moving object assessment, including an object path of one or more moving objects in a search area, using a plurality of imaging devices and segmentation by background subtraction. In Pavlidis, the object included customers. Pavlidis was primarily related to monitoring a search area for surveillance.
U.S. Pat. Appl. Pub. No. 2004/0120581 of Ozer, et al. (hereinafter Ozer) disclosed a method for identifying the activity of customers for marketing purposes or the activity of objects in a surveillance area, by comparing the detected objects with the graphs from a database. Ozer tracked the movement of different object parts and combined them to high-level activity semantics, using several Hidden Markov Models (HMMs) and a distance classifier. U.S. Pat. Appl. Pub. No. 2004/0131254 of Liang, et al. (hereinafter Liang) also disclosed the Hidden Markov Models (HMMs) as a way, along with the rule-based label analysis and the token parsing procedure, to characterize behavior in their disclosure. Liang disclosed a method for monitoring and classifying actions of various objects in a video, using background subtraction for object detection and tracking. Liang is particularly related to animal behavior in a lab for testing drugs. Neither Ozer or Liang disclosed a method or system for tracking people in a physical space using multiple cameras.
Activity Analysis in Various Other Areas, Such as Surveillance Application
There have been earlier attempts for activity analysis in various other areas than understanding customers' shopping behavior, such as the surveillance and security applications.
The following prior arts are not restricted to the application area for understanding customers' shopping behaviors in a targeted environment, but they disclosed methods for object activity modeling and analysis for a human body, using a video, in general.
U.S. Pat. Appl. Pub. No. 2002/0085092 of Choi, et al. (hereinafter Choi) disclosed a method for modeling an activity of a human body using the optical flow vector from a video and probability distribution of the feature vectors from the optical flow vector. Choi modeled a plurality of states using the probability distribution of the feature vectors and expressed the activity based on the state transition.
U.S. Pat. Appl. Pub. No. 2004/0113933 of Guler disclosed a method for automatic detection of split and merge events from video streams in a surveillance environment. Guler considered split and merge behaviors as key common simple behavior components in order to analyze high-level activities of interest for surveillance application: which are also used to understand the relationships among multiple objects, not just individual behavior. Guler used adaptive background subtraction to detect the objects in a video scene, and the objects were tracked to identify the split and merge behaviors. To understand the split and merge behavior-based, high-level events, Guler used a Hidden Markov Model (HMM).
The prior arts lack the features for automatically analyzing the trips of people in a physical space, by capturing multiple input images of the people by multiple means for capturing images and tracking the people in each field of view of the means for capturing images, while joining the track segments across the multiple fields of views and mapping the trips on to the coordinates of the physical space. Essentially, the prior arts lack the features for finding the information of the trips of the people based on the automatically processed results from the plurality of tracks using computer vision algorithms. Therefore, a novel usage of computer vision technologies for understanding the shoppers' trips in a more efficient manner in a physical space, such as a retail environment, is needed.
Tracking Using Multiple Cameras:
There have been earlier attempts to detect and track a human body part in a physical space. Background subtraction is one of the exemplary methods to detect the tracked object in the video. There have also been earlier attempts for multiple people tracking in a video.
However, the prior art video tracking systems heretofore known lack many of the functional performance and robustness capabilities that are needed for understanding the shoppers' trip information in a retail environment as it will be discussed later. Therefore, there is still need for a novel usage of computer vision technologies for understanding the shoppers' trips in a more efficient manner in a physical space.
U.S. Pat. No. 6,061,088 of Khosravi, et al. (hereinafter Khosravi) disclosed a multi-resolution adaptation system. The primary difference between Khosravi and the present invention is that the present invention sub-samples the image initially, whereas Khosravi breaks down a full frame image into smaller regions, and makes a decision for each region. For the present invention, a region is simply 1 pixel.
U.S. Pat. No. 6,263,088 of Crabtree, et al. (hereinafter Crabtree U.S. Pat. No. 6,263,088) is based on a single camera and designed to track people in a space seen from above. Although, the camera calibration data section in Crabtree U.S. Pat. No. 6,263,088 has a similarity with that of the present invention, the primary difference should be pointed out that, as input, Crabtree U.S. Pat. No. 6,263,088 accepted the information regarding the statistical range of the persons' width/heights.
U.S. Pat. No. 6,272,250 of Sun, et al. (hereinafter Sun) disclosed a method for clustering pixel data into groups of data of similar color. Sun is based on a single camera or video and requires an elaborate color clustering approach, making their method computationally expensive and not suitable for tracking general targets in 3D. Sun also disclosed the clustering of the intensity values, and the use of a “vigilance value” to control the effective cluster size. However, importantly, in order to overcome the computational cost in Sun, the present invention differs in that the present invention is using grayscale instead of RGB or YUV.
The method of U.S. Pat. No. 6,394,557 of Bradski is based on using color information to track the head or hand of a person in the view of a single camera. It is well known that the use of only color information in general is insufficient to track small, fast moving objects in a cluttered environment. The method in Bradski is hence much less general and only workable in certain specialized environments.
The method of U.S. Pat. No. 6,404,900 of Qian, et al. (hereinafter Qian) is designed to track human faces in the presence of multiple people. The method is highly specialized to head tracking, making it unsuitable for alternative application domains and targets.
U.S. Pat. Appl. Pub. No. 2005/0265582 of Buehler, et al. (hereinafter Buehler) disclosed a video surveillance system that is capable of tracking multiple objects and monitoring the behavior of the objects. Buehler constructed a track graph, which represented the movement of blobs, based on the observations of the blobs from multiple image sensors for a long period of time, and then Buehler solved the track graph to correspond the blobs to specific objects in the monitored environment.
Although Buehler very briefly suggested “background subtraction” as one of the techniques to separate a foreground object, such as the blob, from the static background, Buehler lacks the details of how to apply the “background subtraction” algorithm in the computer vision research to the foreground detection.
The prior arts above are not intended for understanding the customers' trip information and their shopping behaviors by tracking and analyzing the movement information of the customers in a targeted environment, such as in a retail store. Therefore, the present invention discloses a novel approach of using computer vision technologies for understanding customers' shopping behaviors by tracking and analyzing the movement information of the customers' trips in a targeted environment.