The mainstream adoption of the Internet and Web has changed the physics of information diffusion. Until a few years ago, the major barrier for someone who wanted a piece of information to spread through a community was the cost of the technical infrastructure required to reach a large number of people. Today, with widespread access to the Internet, this bottleneck has largely been removed. In this context, personal publishing modalities such as weblogs have become prevalent. Weblogs, or “blogs,” are personal online diaries managed by easy-to-use software packages that allow single-click publishing of daily entries. The contents are observations and discussions ranging from the mainstream to the startlingly personal. There are several million weblogs in existence today. The weblogs and linkages between the weblogs are referenced as “blogspace”.
Unlike earlier mechanisms for spreading information at the grassroots level, weblogs are open to frequent widespread observation, and thus offer an inexpensive opportunity to capture large volumes of information flows at the individual level. Furthermore, weblogs can be analyzed in the context of current affairs due to recent electronic publication standards that allow gathering of dated news articles from sources such as Reuters and the AP Newswire. Sources such as Reuters and the AP Newswire have enormous influence on the content of weblogs.
Weblogs typically manifest significant interlinking, both within entries and in boilerplate matter used to situate the weblog in a neighborhood of other weblogs that participate in the same distributed conversation. One conventional approach to analyzing information flow blogspace analyzes the “business” of blogs, capturing bursts of activity within blog communities based on an analysis of the evolving link structure. Reference is made to R. Kumar, et al., “On the bursty evolution of blogspace”, In Proc. WWW, 2003.
Much previous research investigating the flow of information through networks has been based upon the analogy between the spread of disease and the spread of information in networks. This analogy brings centuries of study of epidemiology to bear on questions of information diffusion. Reference is made to N. Bailey, “The Mathematical Theory of Infectious Diseases and its Applications”, Griffin, London, 2nd edition, 1975. Classical disease-propagation models in epidemiology are based upon the cycle of disease in a host. A person is first susceptible (S) to the disease. If then exposed to the disease by an infectious contact, the person becomes infected (I) (and infectious) with some probability. The disease then runs its course in that host, who is subsequently recovered (R) (or removed, depending on the virulence of the disease).
A recovered individual is immune to the disease for some period of time, but the immunity may eventually wear off. SIR models diseases in which recovered hosts are never again susceptible to the disease as with a disease conferring lifetime immunity, like chicken pox. SIR further models a highly virulent disease from which the host does not recover. SIRS models a situation in which a recovered host eventually becomes susceptible again, as with influenza.
In blogspace, the SIRS model can be applied as follows: a blogger who has not yet written about a topic is exposed to the topic by reading the blog of a friend. She decides to write about the topic, becoming infected. The topic may then spread to readers of her blog. Later, she may revisit the topic from a different perspective, and write about it again.
One conventional approach to propagation of infectious diseases studied an SIR model with mutation, in which a node u is immune to any strain of the disease that is sufficiently close to a strain with which u was previously infected. Reference is made to M. Girvan, et al., “A simple model of epidemics with pathogen mutation”, Phys. Rev. E, 65(031915), 2002. This approach observes that for certain parameters it is possible to generate periodic outbreaks in which the disease oscillates between periods of epidemic outbreak and periods of calm while it mutates into a new form. In blogspace, one can imagine the mutation of a movie star into a political figure.
Early studies of propagation took place on “fully mixed” or “homogeneous” networks in which contacts of a node are chosen randomly from the entire network. Recent work, however, focuses on more realistic models based on social networks. In a model of small-world networks, one conventional approach to propagation of infectious diseases calculates the minimum transmission probability for which a disease can spread from one seed node to infect a constant fraction of the entire network (known as the epidemic threshold). Reference is made to C Moore, et al., “Epidemics and percolation in small-world networks”, Phys. Rev. E, 61:5678-5682, 2000. cond-mat/9911492; and D. Watts, et al., “Collective dynamics of “small-world” networks”, Nature, 393:440-442, 1998.
One conventional approach to modeling epidemic spreading on networks follows a power law, in which the probability that the degree of a node is k is proportional to k−α, for a constant α typically between 2 and 3. Many real-world networks have the power law property (reference is made to M. Mitzenmacher, “A brief history of lognormal and power law distributions”, In Allerton Comm. Control Comput., 2001], including a social network defined by blog-to-blog links [reference is made to R. Kumar, et al., “On the bursty evolution of blogspace”, In Proc. WWW, 2003]. Another conventional approach analyzes an SIS model of computer virus propagation in power-law networks, showing that (in stark contrast to random or regular networks) the epidemic threshold is zero, so an epidemic always occurs. Reference is made to R. Pasto-Satorras, et al., “Epidemic spreading in scale-free networks”, Phys. Rev. Letters, 86(14): 3200-3203, April 2001.
These results of analyses of propagation in power-law networks can be interpreted in terms of the robustness of the network to random edge failure. Suppose that each edge in the network is deleted independently with probability (1−ε). The network is considered “robust” if most of the nodes are still connected. Nodes that remain in the same component as some initiator v0 after the edge deletion process are exactly the same nodes that v0 infects according to the disease transmission model above. The use of viral propagation through power law networks has been considered from the perspective of error tolerance of networks such as the Internet to determine the behavior of the network if a random (1−ε) fraction of the links in the Internet fail. Many researchers have observed that power-law networks exhibit extremely high error tolerance. Reference is made to R. Albert, et al., “Error and attack tolerance of complex networks”, Nature, 406, July 2000; and B. Bollabas, et al., “Robustness and vulnerability of scale-free random graphs”, Internet Mathematics, 1(1), 2003.
In blogspace, however, many topics propagate without becoming epidemics, so such a model would be inappropriate. One refinement uses a more accurate model of power-law networks, demonstrating a non-zero epidemic threshold under the SIS model in power-law networks produced by a certain generative model that takes into account the high “clustering coefficient” found in real social networks. Reference is made to V. Eguíluz, et al., “Epidemic threshold in structured scale-free networks”, Physical Review Letters, 89, 2002. cond-mat/0205439 and D. Watts, et al., “Collective dynamics of “small-world” networks”, Nature, 393:440-442, 1998. The clustering coefficient is the probability that two neighbors of a node are themselves neighbors.
Another refinement modifies the transmission model by considering the flow of information through real and synthetic email networks under a model in which the probability of infection decays as the distance to the initiator v0 increases. Reference is made to F. Wu, et al., “Information flow in social groups”, Manuscript, 2003. Meme outbreaks under this model are typically limited in scope, following behavior of real data. A meme is an idea or a topic that spreads much like a virus through a population. The simulated spread of email viruses has been empirically examined by examining the network defined by the email address books of a user community. Reference is made to M. Newman, et al., “Email networks and the spread of computer viruses”, Phys. Rev. E, 66(035101), 2002. A further refinement calculates the properties of disease outbreaks, including the distribution of outbreak sizes and the epidemic threshold, for an SIR model of disease propagation. Reference is made to M. Newman, “The spread of epidemic disease on networks”, Phys. Rev. E, 66(016128), 2002.
The spread of a piece of information through a social network can also be viewed as the propagation of an innovation through the social network. For example, the URL of a website that provides a new, valuable service is such a piece of information. In the field of sociology, there has been extensive study of the “diffusion of innovation” in social networks, examining the role of “word of mouth” in spreading innovations. At a particular point in time, some nodes in the network have adopted the innovation, and others have not.
Two fundamental models for the process by which nodes adopt new ideas have been considered in the literature: threshold models and cascade models. In a threshold model, each node u in the network chooses a threshold tuε[0, 1], typically drawn from some probability distribution. Reference is made to M. Granovetter, “Threshold models of collective behavior”, American Journal of Sociology, 83(6): 1420-1443, 1987. Every neighboring node v of node u has a nonnegative connection weight wu,v so thatΣvεΓ(u)wu,v≦1and node u adopts if and only iftu≦Σadopters vεΓ(u)wu,v
In a cascade model, whenever a node vεΓ(u) that is a social contact of a node u adopts, then node u adopts with some probability pv,u. Reference is made to J. Goldenberg, et al., “Talk of the network: A complex systems look at the underlying process of word-of-mouth”, Marketing Letters, 12(3): 211-223, 2001. In other words, every time a node (person) close to a node u such as node v adopts, there is a chance that node u decides to “follow” node v and adopt as well.
One approach utilizes an “independent cascade model” with a given set of N nodes, some of which have already adopted. Reference is made to J. Goldenberg, et al., “Talk of the network: A complex systems look at the underlying process of word-of-mouth”, Marketing Letters, 12(3): 211-223, 2001. At the initial state, some non-empty sets of nodes are “activated.” At each successive step, some (possibly empty) sets of nodes become activated. The episode is considered over when no new activations occur. The set of nodes are connected in a directed graph with each edge (u, v) labeled with a probability pu,v. When node u is activated in step t, each node v that has an arc (u, v) is activated with probability pu,v. This influence is independent of the history of all other node activations. Further, if v is not activated in that time step, then u never activates v.
A “general cascade model” generalizes the independent cascade model and simultaneously generalizes the threshold models described above by discharging the independence assumption. Reference is made to D. Kempe, et al., “Maximizing the spread of influence through a social network”, In Proc. KDD, 2003. The general cascade model addresses a related problem on social networks with a marketing motivation: assuming that innovations propagate according to such a model, and given a number k, find the k “seed” nodes Sk* that maximize the expected number of adopters of the innovation if nodes Sk* adopt initially. One can then give free samples of a product to nodes Sk*, for example.
The propagation of information through a social network has also been studied from a game-theoretic perspective, in which one postulates an increase in utility for players who adopt the new innovation or learn the new information if enough of their friends have also adopted. For example, each player chooses whether to switch from videotape to DVDs; a person with friends who have made the same choice can benefit by borrowing movies. In blogspace, sharing discussion of a new and interesting topic with others in one's immediate social circle may bring pleasure or even increased status.
One game-theoretic approach considers a setting such as the following coordination game: in every time step, each node in a social network chooses a type {0, 1}. Players of type 1 have adopted the meme. Each player i receives a positive payoff for each of its neighbors that has the same type as i, in addition to an intrinsic benefit that i derives from its type. Further, each player may have a distinct utility for adopting, depending on his inherent interest in the topic. Suppose that all but a small number of players initially have type 0. This game-theoretic approach explores the question of whether players of type 1 can “take over” the graph if every node chooses to switch to type 0 with probability increasing as the number of the neighbors of i that are of type 0 increases.
There has also been work in the economics community on models of the growth of social networks when an agent u can selfishly decide to form a link with another agent v, who may have information that agent u desires to learn. There is a cost born by agent u to establish such a link, and a profit for the information that agent u learns through this link. This approach explores properties of the social network that forms under this scenario. Reference is made to V. Bala, et al., “A strategic analysis of network reliability”, Review of Economic Design, 5:205-228, 2000 and H. Haller, et al., “Nash networks with heterogeneous agents”, Working Paper Series E-2001-1, Virginia Tech, 2003.
Although the conventional technologies, analyses, and approaches to modeling transmission of information presented thus far have proven to be useful, it would be desirable to present additional improvements. Many models have been proposed to capture the methods by which the spread of infectious diseases and the spread of memes occur. Epidemiologists proceed in tracing the spread of a disease by interviewing individuals and finding reasons to believe that one person may have had contact with another.
A fundamental need for the determination of propagation of information through a network is the ability to discern topics within the information. The literature around detection and tracking of topics has focused on topics as monolithic structures that may migrate slowly from one focus to another. Study of dialogue on the other hand has focused on the structure of the dialogue rather than the evolution of the topics. However, discussions in weblogs have been shown to typically comprise ongoing discussions of broad topics and in “spikes”. The broad topics comprise low-level chatter on aspects of the topic of particular interest to the participants in a conversation. The spikes are peaks in discussion regarding particular subtopics that have recently emerged in the media such as, for example, in a product announcement or news story. There are no known solutions for automatically extracting this structure from large-scale textual databases.
What is therefore needed is a system, a service, a computer program product, and an associated method for analyzing communication between parties to identify topics and the patterns into which those topics fall. The need for such a solution has heretofore remained unsatisfied.