1. Field of the Invention
The present invention is a method and system to provide an automatic measurement of retail customers' responses to retail elements, based on their facial expressions and behaviors.
2. Background of the Invention
The current consumer and market-oriented economy places a great deal of importance on people's opinions or responses to consumer products or, more specifically, various aspects of the products—product display, packaging, labels, and price. A shopper's interest and attitude toward these elements changes dynamically during engagement and interaction with products, and the end response—such as purchase, satisfaction, etc.—is a final summary of such intermediate changes. Most consumer exposure to such visual cues occurs in retail spaces at an immeasurably high number and frequency. The ability to capture such occurrences and effectively measure consumer responses would provide very valuable information to retailers, marketers, and consumer product manufacturers. Though it is nearly impossible to accurately determine a person's mental response without directly asking about it, a person usually reveals some indications of emotional response through information channels such as facial expressions and bodily gestures. It is usually the expression on the face that has high correlation with the emotional response.
There is also a consensus within the market research community that today's consumers make most of their purchase decisions in stores. Therefore, it is extremely important to understand the decision-making process that goes on within a shopper's mind and, at a deeper level, to understand the kind of emotional changes that lead a shopper's ultimate decision. These consumer responses can also be analyzed within the context of demographics, which can be automatically measured based on facial images.
In a typical shopping scenario, a shopper browses through retail aisles with an intention to buy certain products. Then she/he notices a product or a product category that catches her/his attention (regardless of whether it was intended or not), approaches the shelf, interacts with products, and makes a decision as to which one to buy or not to buy at all. Different stages in this shopping process involve different kinds of visual elements and corresponding mental or physical responses. In the ‘gross level interest’ stage, the shopper takes notice of visual elements that catch her/his attention from a distance—product category, products in promotion, or promotion signs. If the shopper becomes interested, she/he “engages” with the product or category by approaching and stopping at the shelf. Then she/he directly “interacts” with the intended product or further looks for different options within the category or other categories. The interaction involves checking the price, reading the labels, placing the item in the shopping cart, or returning the item to the shelf. The “fine level interest”' of the shopper will reveal which product is currently being considered; typically, picking up the product and/or gaze shows the target of the attention. While these physical cues, such as facing certain direction or looking at certain products, carry much information about the shopper's target of interest and the level of interest, the facial expression of the shopper often reveals a deeper mental response—favor, reservation, disfavor—to the visual elements at each stage, especially during interaction with the products. The response expressed on the face is a very important channel for revealing the internal state. Such information has direct relevance to the success of consumer products or product promotions. On the other hand, the availability of demographic information of each shopper would greatly enrich the analysis, as the shopper response characteristics typically vary with different demographic groups and can provide valuable information for targeted marketing or merchandizing.
The present invention is a method and system to measure the level of shoppers' interest and their mental responses. It utilizes at least one overhead camera to track a shopper's movement and recognize her/his gross-level interest. It also utilizes at least a camera to capture the shopper's face so that the system can measure the gaze and facial expressions.
Recent developments in computer vision and artificial intelligence technology make it possible to detect and track people's behavior from video sequences to further analyze their mental processes—intentions, interests, attractions, opinions, etc. The development in visual tracking technology makes it possible to track shoppers throughout the retail space, and to recognize their engagement and interaction with products. Facial image analysis has been especially matured, so that faces can be detected and tracked from video images, and the motion of the head and facial features can also be estimated. Especially, the head orientation and eye gaze can be measured to estimate the fine-level interest of the shopper. The facial appearance changes due to facial expression can also be measured to estimate the internal emotional state of the person. The estimated facial feature locations help to normalize the facial images, so that machine learning-based demographic classifications can provide accurate demographic information—gender, age, and ethnicity. The proposed invention aims to solve these problems under realistic scenarios where people show natural responses toward visual elements belonging to consumer products—such as product display, product information, packaging, etc. While each instance of such measurement can be erroneous, an accumulated measurement over time will provide reliable information to assess the collective response to a given visual element.
The invention adopts a series of both well-established and novel approaches for facial image processing and analysis to solve these tasks. Body detection and tracking locates shoppers and estimates their movements, so that the system can estimate each shopper's interest to or engagement with products, based on the track of movements. The direction toward which the shopper is facing can also be measured for the same purpose. Face detection and tracking handle the problem of locating faces and establishing correspondences among detected faces that belong to the same person. To be able to accurately locate the facial features, both the two-dimensional (position, size, and orientation) and three-dimensional (yaw and pitch) pose of the face should be estimated. Based on the estimated facial pose, the system normalizes the facial geometry so that facial features—eyes, iris, eyebrows, nose, and mouth—are aligned to standard positions. The estimated positions of irises relative to eyes along with the estimated head orientation reveal the shopper's direction of attention. The invention also introduces a novel approach to extract facial appearance changes due to facial expressions; a collection of image gradient filters are designed that match the shapes of facial features or transient features. A filter that spans the whole size of the feature shape does a more robust job of extracting shapes than do local edge detectors, and will especially help to detect weak and fine contours of the wrinkles (transient features) that may otherwise be missed using traditional methods. The set of filters are applied to the aligned facial images, and the emotion-sensitive features are extracted. These features train a learning machine to find the mapping from the appearance changes to facial muscle actions. In an exemplary embodiment, the 32 Action Units from the well-known Facial Action Coding System (FACS, by Ekman & Friesen) are employed. The recognized facial actions can be translated into six emotion categories: Happiness, Sadness, Surprise, Anger, Disgust, and Fear. These categories are known to reflect more fundamental affective states of the mind: Arousal, Valence, and Stance. The invention assumes that these affective states, if estimated, provide information more directly relevant to the recognition of people's attitudes toward a retail element than do the six emotion categories. For example, the degree of valence directly reveals the positive or negative attitude toward the element. The changes in affective state will then render a trajectory in the three-dimensional affect space. Another novel feature of the invention is to find a mapping from the sequence of affective state to the end response. The central motivation behind this approach is that, while the changes in affective state already contain very useful information regarding the response of the person to the visual stimulus, there can be still another level of mental process to make a final judgment—such as purchase, opinion, rating, etc. These are the kind of consumer feedbacks ultimately of interest to marketers or retailers, and we refer to such process as the “end response.” The sequence of affective state along with the shopper's changing level and duration of interest can also be interpreted in the context of the dynamics of the shopper behavior, because the emotional change at each stage of the shopping process conveys meaningful information about the shopper's response to a product. One of the additional novel features of this invention is to model the dynamics of a shopper's attitude toward a product, using a graphical Bayesian framework such as the Hidden Markov Model (HMM) to account for the uncertainties between the state transitions and the correlation between the internal states and the measured shopper responses.
The mapping from the emotional changes to the end response can be estimated by training an HMM using many samples of sequence of affective state and level of interest along with the ground truth end response data. The HMM not only predict the shopper's end response to the product, but also can decode the observed emotional changes to estimate the likely sequence of the shopper's attitude changes toward the product, called intermediate responses.
The present invention also provides the demographic categories of the shopper based on the localized facial images from the face camera of the system. The demographic classifications can be carried out using learning machines trained from a large number of samples. The demographic categories—such as gender, age, and ethnicity—of the shopper provide valuable information so that the estimated shopper response can be analyzed in the context of demographic groups.
There have been prior attempts for automatically estimating the gaze direction or target of a human observer.
In U.S. Pat. No. 5,797,046 of Nagano, et al., the gaze direction is estimated based on the optical signal of the light reflected by the iris, and on the stored personal signature of the reflection. In U.S. Pat. No. 5,818,954 of Tomono, el al., the measured position of the iris relative to the measured facial coordinate is used to estimate the gaze. In U.S. Pat. No. 6,154,559 of Beardsley, the gaze target is recognized based on the measurement of the head pose and the correlation between known visual target and the head pose. In U.S. Pat. No. 6,246,779 of Fukui, el al., the gaze is estimated by comparing the measured facial image feature pattern against the stored facial image feature patterns. In U.S. Pat. No. 7,043,056 of Edwards, et al., the eye gaze direction is estimated by first determining the head pose angle and then locating the iris position relative to the eye region. The present invention employs basic ideas similar to the mentioned inventions; first estimate the head pose, and locate the eye positions. The position of the irises against the localized eyes provides the data to estimate the gaze direction. However, we adopt a series of machine learning-based approaches to accurately and robustly estimate the gaze under realistic imaging conditions; a two-dimensional facial pose estimation followed by a three-dimensional head pose estimation (using the estimated two-dimensional pose), where both estimations utilize multiple learning machines. The facial features are also accurately localized based on the estimated global facial geometry, again using combinations of multiple learning machines, and each take part in localizing a specific facial feature. Each of these machine learning-based estimations of poses or locations utilizes a set of filters specifically designed to extract image features that are relevant to the given estimation. Finally the estimates of the iris location relative to the eye location, combined with the head pose estimate, are used to estimate the gaze direction.
There have been prior attempts for automatically recognizing the visual target and the level of interest by a human observer.
U.S. Pat. No. 7,120,880 of Dryer, et al. proposes a system utilizing a host of measurement modalities, such as facial expression, head gesture, or speech, to assess the level of interest to media contents; it proposes an overall system, without introducing a very specific novel technical means to achieve the recognition of the response or affective information. The present invention introduces novel technology to automatically extract relevant information from the raw image data and recognize the internal (mental/emotional) state of the human. The present invention also uses learning machines such as neural networks, but the learning machines are trained to process feature vectors that are extracted from video images following novel and specific procedures.
There have been prior attempts for automatically recognizing the shopping behavior of retail customers.
In U.S. Pat. No. 6,659,344 of Otto, et al. (hereinafter Otto), the purchase behavior of retail customers and the purchased items (which have RFID tags) are recognized utilizing an RFID scanner attached to a shopping container. In U.S. Pat. No. 7,006,982 of Sorensen (hereinafter Sorensen), a wireless transmitter attached to the shopping cart or carried by the shopper is used to track the shopper's motion throughout the store. In U.S. Pat. No. 7,168,618 of Schwartz (hereinafter Schwartz), an image capture device is used to identify and track the items in the store shelf and in the shopping containers. In the present invention, as in Schwartz, at least one image capture device is strategically placed to capture the shopper's movement and the items in the shelf and the shopping containers, unlike Otto and Sorensen, where wireless transmitters are attached either to the products or the shopper/shopping cart to track the shopping behavior. While Schwartz only introduces an overall method to recognize and track shopping items, the present invention adopts strategic camera positioning and specific image analysis algorithms to track not only the purchased items, but also to track the shoppers, to provide comprehensive shopping behavior data. In U.S. Prov. Pat. Appl. No. 60/877,953 of Sharma, et al. (hereinafter Sharma), a collection of computer vision-based technology is employed to recognize a customer's behavior and engagement with certain product categories in the retail environment. The present invention adopts approaches similar to Sharma to recognize a shopper's interaction with products and identify the group of products with which the shopper engages. In the present invention, a specific technical mean is employed to recognize each of the incidents of engagement, interaction, and purchase. Furthermore, these shopper interactions are measured for the purpose of analyzing the affective state and interest changes of the shopper in the context of these identified behavior segments.
There have been prior attempts for automatically recognizing the facial expression of a person using video images.
In U.S. Pat. No. 5,774,591 of Black, et al., the motions of the facial features due to expression are estimated by computing an explicit parametric model of optical flow. The facial feature motions are translated into mid-level predicates, which in turn are used to determine the expression categories. The proposed invention utilizes emotion-sensitive features that extract feature shape changes implicitly, just to be fed to a learning machine to estimate the facial muscle action. In U.S. Pat. No. 6,072,496 of Guenter, et al., the facial actions are estimated in terms of a very involved three-dimensional mesh model by tracking a set of dedicated marker points. The present invention; strives to estimate the shape change of the facial features just enough to determine the facial muscle action, without using any artificial markers. U.S. Pat. No. 6,879,709 of Tian, et al. (hereinafter Tian-1) only aims to detect emotionless faces, while the present invention tries to estimate the change of expressions in a space representing the whole range of human emotions. In de U.S. Pat. Appl. Pub. No. 20070265507 of de Lemos, mostly eye tracking estimates are used to assess the degree of attention and the location of attention within the visual stimulus. The present invention shares a similar goal of estimating human response in relation to a given visual stimulus, but introduces a different focus on the measurement of whole facial feature shapes to determine the emotional changes to a visual stimulus, with specific technical methods to estimate the facial actions, emotional changes, and finally the response. “Measuring facial expressions by computer image analysis,” Psychophysiology, vol. 36, issue 2, by Barlett, et al. (hereinafter Barlett) aims to estimate upper facial Action Units, utilizing the holistic, feature-based, and motion (flow)-based image representation and a neural network-based learning of the representation. “Recognizing Action Units for Facial Expression Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, by Tian, et al. (hereinafter Tian-2) also estimates parametric models of facial feature shapes, and employs neural networks to learn the mapping to the facial Action Units. The present invention also estimates the facial Action Units in an exemplary embodiment of facial muscle actions, and utilizes a learning machine to find a mapping from the image representation to the muscle actions. However, the present invention utilizes a emotion-sensitive feature extraction scheme, which is different from Barlett or Tian-2. The present invention also utilizes a novel scheme to localize a face and its facial features, while in Barlett the faces are assumed to be aligned. In Zhang “Active and dynamic information fusion for facial expression understanding from image sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 27, Issue 5, by Zhang, et al., the dynamic change of facial expressions is recognized by a series of methods starting from IR-based eye detection, and facial feature detection based on the eye detection. The facial Action Unit recognition is based on deterministic correspondence. U.S. patent application Ser. No. 12/154,002 of Moon, et al. (hereinafter Moon) employs a novel combination of face detection, localization, and facial feature localization. The mapping from the facial feature shapes to the facial muscle actions is learned by training on a large number of samples, and the recognized facial muscle actions are translated to affective state. The emotional response is determined from analysis on the constructed sequence of affective state. The present invention adopts similar approaches for facial image processing and emotion recognition. However, one of the novel features of the present invention is to utilize the shopper's target of interest and the shopper behavior measured from body image sequence so that the changes in affective state are segmented and analyzed in the context of shopper interaction. Unlike Moon, the shopper's intermediate responses—the changes in attitude toward the product—are estimated using a graphical Bayesian framework, in addition to the end response to the product.
In summary, the present invention provides fully automatic face localization and facial feature localization approaches, for accurately extracting facial and transient features to estimate facial muscle actions due to emotion changes. For gaze estimation, we adopt a series of machine learning-based approaches to accurately and robustly estimate the gaze under realistic imaging conditions, without using specialized imaging devices and without requiring close-range images. The shopper's interaction with retail elements is identified based on the shopper's trajectory and body orientation, both measured automatically from an image sequence, without using special tracking hardware. The present invention shares the goal of estimating a shopper's response in relation to a given retail product similar to other rating approaches, but it adopts a unique method to determine the end response and intermediate responses using a graphical Bayesian framework.