The present invention relates to data analysis techniques usable for identifying, in a population of communicating entities, a group of entities that can form a suitable target in view of their expected ability to influence other entities.
This kind of technique usually makes use of a social network which is a data structure representing existing or passed communication relationships between the entities of the population. An appropriate analysis of the social network can help detecting influencers in the population to better understand propagation of certain phenomena or to decide on certain actions, like for example marketing campaigns, for which word-of-mouth type of propagation is desirable.
The literature on influencers has been growing very fast in the last ten years, with interest coming from many domains (sociology, marketing, political science, and social media for example). There is no real consensus yet on the definition of influencer: from “an individual who exerts influence” to “a person who has a greater than average reach or impact through word of mouth in a relevant marketplace” (B. Fay, et al., “WOMMA Influencer Handbook—The Who, What, When, Where, How, and Why of Influencer Marketing”, Word of Mouth Marketing Association, 2010, http://womma.org/influencerhandbook/), definitions range from utter circularity to operational meaning. It usually includes the reference to a social structure through which influence is propagated.
The two main issues described in the literature are about identifying influencers and then acting on influencers (for example, by orienting marketing activities to them rather than to the entire market).
Influencers have first been defined by specific attributes discovered through standard market research techniques and organized in typical categories (for example, “media elite” or “socially connected”). Then, various methods were developed to rank-order entities so as to be able to distinguish those who are key influencers from those with less influence. These methods are mostly based upon centrality measures which one can use to measure how influential an entity is. For example, C. Kiss et al. define structural measures of influence (degree centrality, closeness centrality, betweenness centrality, etc.) and link topological ranking measures (HITS, PageRank, SenderRank) in, “Identification of Influencers—Measuring Influence in Customer Networks”, Decision Support Systems, Vol. 46, No. 1, Pages 233-253, December 2008. Other authors have used node position (for example k-shell in “Identifying influential spreaders in complex networks”, M. Kitsak, et al., Nature Physics, Vol. 6, No. 11, pp. 888-893, 2010) to identify influencers.
To evaluate performance of these measures for ranking entities, most work has focused on analyzing the propagation of the information flow through the social network. Using ideas stemming from infection diffusion theory in epidemiology, one hypothesizes a propagation model which describes how one node infects its neighbors. Then, the model is used to measure how many people were “infected” by a given entity: it identifies the cascades of entities infected by the original one (J. Leskovec, et al., “The Dynamics of Viral Marketing”, ACM Transactions on the Web, Vol. 1, No. 1, Article 5, May 2007). Authors then proceed to estimate the parameters of the diffusion model, such as for example in K. Saito et al., “Learning Diffusion Probability based on Node Attributes in Social Networks”. ISMIS 2011. pp 153-162. 2011. The objective of selecting best influencers indeed is to reach the largest possible number of entities as illustrated in FIG. 1. Results have shown that the number of neighbors is not necessarily a good measure of influence (M. Cha, et al., “Measuring User Influence in Twitter: The Million Follower Fallacy”, Artificial Intelligence, 2010, pp. 10-17), and that the choice of the propagation model parameters changes the ranking of the various centrality measures. However, most authors claim that centrality measures indeed have predictive power allowing to rank-order entities and select influencers (D. M. Romero, et al., “Influence and Passivity in Social Media”, WWW 2011, Hyderabad, India, Mar. 28-Apr. 1, 2011).
However, some authors consider that it is unrealistic to hope to identify influencers and that the “epidemics” analogy is very misleading. See “Viral marketing for the real world”, D. J. Watts, et al., Harvard Business Review, Issue May 2007, or “The Accidental Influentials”, D. J. Watts, Harvard Business Review, February, 2007, pp. 22-23.
The approach in the present proposal is based on the consideration that instead of positing an a priori propagation model to identify the influencers and then estimate its parameters, it is more efficient—and realistic—to build predictive models using the available data to predict the most probable influencers.
To introduce some notations, we consider a social network in the form of a graph G(N, E) having nodes N indexed by integers i and edges or links E between the nodes. An adjacency matrix or transition matrix of graph G is defined as A=(aij) where aij is a weight of the link from node i to node j (aij=0 if there is no link from node i to node j in G). An unweighted transition matrix corresponds to the case where aij=1 if there is a link from node i to node j and aij=0 else. Weighted transition matrices can, for example, be defined for a graph whose nodes represent communicating entities, where the aijs have amplitudes depending on factors such as duration of communication from i to j, or from number of calls from i to j, etc. The neighbors of a node i in the graph can be grouped in different subsets as illustrated in FIG. 2:                the “out-circle” OCi of node i is the set of nodes of G linking out of i, that is OCi={j: aij≠0};        the “in-circle” ICi of node i is the set of nodes of G linking into i: ICi={j: aji≠0};        the “circle” Ci of node i is the set of all the nodes of G linked to i: Ci=OCi∪ICi={j: aij≠0 or aji≠0}.        
If the links in the graph are not directed, i.e. if they represent communication between nodes regardless of direction of communication, the in-circle and out-circle cannot be distinguished. In this case aij=aji and the circle of a node i can be defined as Ci={j: aij≠0}.
Examples of conventional structural centrality measures include:                degree centrality, Degree(i), that is the number of nodes in the circle: Degree(i)=Card(Ci);        weighted degree centrality:        
            w_Degree      ⁢              (        i        )              =                  ∑                  j          ≠          i                    ⁢              (                              a            ij                    +                      a            ji                          )              ;                in-degree centrality, InDegree(i), that is the number of nodes in the in-circle: InDegree(i)=Card(ICi);        weighted in-degree centrality:        
            w_InDegree      ⁢              (        i        )              =                  ∑                  j          ≠          i                    ⁢              a        ij              ;                out-degree centrality, OutDegree(i), that is the number of nodes in the out-circle: OutDegree(i)=Card(OCi);        weighted out-degree centrality:        
            w_OutDegree      ⁢              (        i        )              =                  ∑                  j          ≠          i                    ⁢              a        ij              ;                clustering coefficient, CC(i), which measures how more likely two neighbors are connected, compared to two random nodes. It is computed as        
      C    ⁢                  ⁢          C      ⁡              (        i        )              =            2      ×      Nb_Tr      ⁢              (        i        )                            Degree        ⁡                  (          i          )                    ×              (                              Degree            ⁡                          (              i              )                                -          1                )                            from the degree centrality Degree(i) and the number Nb_Tr(i) of triangles in the graph having node i as a vertex: Nb_Tr(i)=Card({(j, l)εCi×Ci; j≠l/ajl≠0});        betweeness centrality, CB(i), which measures the extent to which a node is between many nodes:        
            C      ⁢                          ⁢              B        ⁡                  (          i          )                      =                  ∑                              j            ≠            i                                              l              ≠              j                        ,            i                              ⁢                                    g            jl                    ⁡                      (            i            )                                    g          jl                      ,                where the length of a path between two nodes is the number of edges in the path, gjl is the shortest path length from node j to node l (also called the geodesic distance) and gjl(i) is the number of shortest paths between node j and node l going through node i.        
While degree centralities are easy to compute, more sophisticated measures can hardly be computed on large networks. For example, betweenness centrality scales as n2 (n being the number of nodes in the graph), which makes it impractical for large networks. Many more measures exist with the same problem of non-scalability.
Structural centrality measures do not take into account the specific behavior for which influence is being analyzed. With structural centrality measures, if a node is an influencer for a behavior A, it is also an influencer for another behavior B.
On the Web, influence is referred to as popularity. Some web pages are very popular. An algorithm used by search engines to identify popular pages is known under the trademark PageRank. It is based on the consideration that a page is popular if pages linking into it (i.e. in in-circle) are popular. PageRank centrality CPR(i) is computed iteratively as
            C      ⁢                          ⁢      P      ⁢                          ⁢              R        ⁡                  (          i          )                      =                  (                  1          -          d                )            +              d        ×                              ∑                          j              ∈                              IC                i                                              ⁢                                    CPR              ⁡                              (                j                )                                                    OutDegree              ⁡                              (                j                )                                                          ,d being the probability that, at each page, a user requests a random page (d=0.85 usually). PageRank only takes into account incoming links. Approximation by the in-degree centrality is generally accurate.
Symmetrically, SenderRank centrality, CSR(i), can be defined as equivalent to PageRank centrality for outgoing links. The influence of a node i then depends on the influence of nodes it links into, i.e. of its out-circle:
            C      ⁢                          ⁢      S      ⁢                          ⁢              R        ⁡                  (          i          )                      =                  (                  1          -                      d            ′                          )            +                        d          ′                ×                              ∑                          j              ∈                              OC                i                                              ⁢                                    CSR              ⁡                              (                j                )                                                    OutDegree              ⁡                              (                j                )                                                          ,d′ being the probability that a node will transfer to a random node. Computation of CSR(i) is iterative and happens in a few iterations (as for PageRank). It can be approximated by the out-degree centrality.
PageRank and SenderRank are based on the link topology of the network. In this regard, they are still structural measures which cannot take into account a specific behavior.
In certain cases, attributes of the nodes (e.g. demographics, customer care history, account history, etc.) can be taken into consideration in the identification of influencers in combination with a social network representation (see, e.g. US 2009/0062354 A1).
There is a need for an efficient method of analyzing social network data and past behavioral data in view of determining a target of communicating entities that is designed with respect to a specific behavior.