Very large-scale datasets and graphs are ubiquitous in today's world: World Wide Web, online social networks, huge search logs and query-click logs regularly collected and processed by search engines, and so forth. Because of the massive scale of these datasets, doing analyses and computations on them is infeasible for individual machines. Therefore, there is a growing need for distributed ways of storing and processing these datasets. MapReduce, a simple model of computation, has recently emerged as a very attractive way of doing such analyses. Its effectiveness and simplicity has resulted in its implementation by different Internet companies, and widespread adoption for a wide range of applications, including large-scale graph computations.
One of the most well known graph computation problems is computing personalized page ranks (PPR). Personalized page ranks (and other personalized random walk based measures) have proved to be very effective in a variety of applications, such as link prediction and friend recommendation in social networks, and there are many algorithms designed to approximate them in different computational models, including random access, external memory, and distributed models. Here, and throughout this specification, we assume to have a weighted directed graph G=(V,E) with n nodes and m edges. We denote the weight on an edge (u,v) ε E with αu,v and, for the sake of simplifying the presentation of some of the formulae, assume that the weights on the outgoing edges of each node sum up to one.
Page rank is the stationary distribution of a random walk that at each step, with a probability ε, usually called the teleport probability, jumps to a random node and with probability 1−ε follows a random outgoing edge from the current node. Personalized page rank is the same as page rank, except all the random jumps are done back to the same node, denoted as the “source” or “seed” node, for which we are personalizing the page rank. The personalized page rank of node v, with respect to a source node u, denoted by πu(v), satisfies:πu(v)=εδu(v)+(1−ε)Σ{w|(w,v)εE }πu(w)αw,v  (1)
Where δu(v)=1 if and only if u=v (and zero otherwise). The fully personalized page rank computation problem is to compute all the vectors {right arrow over (πu)} for all u ε V. Most applications, such as friend recommendation or query suggestion, only require the top k values (and corresponding nodes) in each PPR vector (for some suitable value of k).
There are two broad approaches to computing personalized page rank. The first approach is to use linear algebraic techniques, such as Power Iteration. The other approach is Monte Carlo, where the basic idea is to approximate personalized page ranks by directly simulating the corresponding random walks and then estimating the stationary distributions with the empirical distributions of the performed walks. Monte Carlo methods generally depend on generating a large number of inputs that are truly random. Based on this idea, it has previously been proposed to start at each node u ε V, do a number, R, of random walks starting at u, called “fingerprints”, each having a length geometrically distributed as Geom(ε). Each fingerprint simulates a continuous session by a random surfer who is doing the PPR random walk. Then, the frequencies of visits to different nodes in these fingerprints will approximate the personalized page ranks.
MapReduce is a simple computation model for processing huge amounts of data in massively parallel fashion, using a large number of commodity machines. By automatically handling the lower level issues, such as job distribution, data storage and flow, and fault tolerance, it provides a simple computational abstraction. In MapReduce, computations are done in three phases. The Map phase reads a collection of values or key/value pairs from an input source, and by invoking a user defined Mapper function on each input element independently and in parallel, emits zero or more key/value pairs associated with that input element. The Shuffle phase groups together all the Mapper-emitted key/value pairs sharing the same key, and outputs each distinct group to the next phase. The Reduce phase invokes a user-defined Reducer function on each distinct group, independently and in parallel, and emits zero or more values to associate with the group's key. The emitted key/value pairs can then be written on the disk or be the input of a Map phase in a following iteration.
Unfortunately, existing Monte Carlo methods for determining random walks for personalized page ranks are slow and computationally intensive in terms of processing and input/output (I/O) resources. This limits the breadth and scope of problems that can be solved using these methods.