Analyzing or mining online social networks (OSN) has become one of the most pressing problems of modern-day data mining. The need arises due to the exponential growth of these networks as they become increasingly popular. Networks such as Facebook®, Twitter®, and LinkedIn® include vast amounts of information and vast amounts of interconnection information that is useful for many purposes. Some non-limiting examples of purposes include commercial purposes, management purposes, political purposes, research purposes, demographic purposes, emergency preparedness applications, and defense and security applications.
Tracing through social graphs is desirable as is identifying social trends, that said, a brute force approach is problematic due to the sheer volume and ever changing nature of the data set. For example, the Facebook® network takes up hundreds of terabytes of memory storage relating to hundreds of millions of people. The volume of information is expanding on a daily basis as more and more people join the service or post information. Processing the information remains a daunting task. Apparently effective methods for processing the data include specific record analysis where a single record is selected and analysed, for example to determine a suitable advertisement for a particular user; crawling, where data is crawled off line and the results of crawling are useful in indexing or searching the large dataset, responsively; and sampling, where a small sample is selected from the huge data set for use in responsive analysis. In order for sampling to work correctly, sample data is preferably representative of the whole data set or, alternatively the results of the analysis and sampling are together representative of the dataset. Recent research has shown that sampling is achievable by crawling online social networks to find a relatively small representative sample suitable, for example, for studying properties and testing algorithms. The sample is extracted through the crawling and then used for more responsive data analysis.
A number of existing techniques for crawling include the breadth-first search (BFS) and random walks (RW). It is known that these techniques usually yield a bias toward the most highly connected nodes. With social networks, this is highly problematic as some nodes have such high connectivity—imagine someone famous—that they skew resulting samples. That said, crawling using the traditional Metropolis-Hasting algorithm (MH), which is a typical Monte Carlo Markov Chain (MCMC) technique, can create unbiased samples suitable for the problem of social network analysis and social network activity analysis.
The breadth-first search (BFS) method, which is regarded as a graph traversal technique, explores the next node assuming the traditional breadth-first search algorithm. It has been used practically for sampling online social networks in past research. Recent research also shows that the methodology sometimes densely covers a specific region of a graph due to incomplete search, but this bias is potentially correctable by deriving an unbiased estimator of the original node degree distribution. That said, even such an estimator may be difficult to derive.
The random walk (RW) method chooses a next state W uniformly and at random among the neighbors of a current node V. Because the probability of the RW at the particular node V converges, the random walk sample nodes are biased towards high degree nodes. This bias may be corrected by an appropriate re-weighting of the measured value such as the Hansen-Hurwitz estimator. That said, the biasing is problematic in most social networks as some nodes are extremely high degree relative to others.
The Metropolis-Hasting Random Walk (MHRW) method appropriately modifies transition probabilities so that over sufficient time it converges to a uniform distribution. The Metropolis-Hasting process is a typical Markov Chain Monte Carlo (MCMC) technique for sampling from a probability distribution.
Techniques for crawling using random walks based on traditional Markov Chain Monte Carlo (MCMC) methods are known. Typically, the chain is started from an initial state, and it is run for some burn-in time assumed long enough for the chain to have converged. Generated samples are assumed to be truly samples from a stationary distribution. Although various diagnostics such as Geweke Diagnostic and Gelman-Rubin Diagnostic can be used for assessing convergence, none of them guarantees that the chain has exactly converged. As a result, the samples are usually only approximate. It has been shown that the MHRW requires a large number of rejections during the initial sampling process, and the method is subject to slow mixing.
However, MCMC techniques such as the Metropolis-Hasting process come with significant challenges: significant burn-in lengths and correlation with initial node choice are just two significant drawbacks. This usually leads to slow mixing. For example, recent research has shown that the Metropolis-Hasting Random Walk (MHRW) process usually produces unbiased samples of Facebook® by randomly requesting 84 k samples for convergence after discarding a burn-in length of 6 k. On the other hand, various convergence diagnostic methods such as Geweke Diagnostic cannot guarantee the chain has converged to a sample value from the desired distribution. Therefore, the sample, e.g., 78 k, obtained by such MCMC algorithm is usually approximate. Verification of the sample is a time consuming task because the data set, which is very large, must be analysed to determine the representative nature of the sample.
It would be advantageous to provide a more deterministic approach to sample generation.