The Sybil attack in computer security is an attack on a system that comprises of is multiple distinct entities (e.g., a peer-to-peer system or an Online Social Network) wherein the attacker forges and controls multiple pseudonymous (fake) identities. The attacker uses the fake identities to gain a disproportionately large influence and subvert the system, such as manipulating a vote result. In an electronic network environment, particularly in social networks, Sybil attacks are commonplace due to the open nature of these platforms. With a Sybil attack a malicious user creates multiple fake identities or fake OSN accounts and pretends to be multiple distinct nodes in the system.
Fake (Sybil) OSN accounts can be used for various profitable malicious purposes, such as spamming, click-fraud, malware distribution, identity fraud and phishing, to name but a few. For instance, they enable spammers to abuse an OSN's messaging system to post spam, or they waste an OSN advertising customers resources by making him pay for online ad clicks or impressions from or to fake profiles. Fake accounts can also be used to acquire users' private contact lists, to increase the visibility of niche content, to post on forums and to manipulate votes or view counts on pages. Sybils can use the “+1” button to manipulate Google search results or to pollute location crowdsourcing results. Furthermore, fake accounts can be used to access personal user information and perform large-scale crawls over social graphs. Some fake accounts are used to spread inappropriate or illegal content such as pornography and extreme violence. Fake accounts are also used in political to campaigns. People also create fake profiles for social reasons. These include friendly pranks, stalking, cyber-bullying, and concealing a real identity to bypass real-life constraints. Many fake accounts are also created for the purpose of joining online social games.
Typically, fake accounts tend to connect to fake accounts. For instance, porn is accounts often have friends that are also porn accounts. Fake celebrities also have friends that are fake celebrities. There are also long chains of fake accounts that are created for spam purposes which appear empty (no photos, no wall) and their friends are also fake and empty.
Due to the multitude of the reasons behind their creation, Sybils in real OSNs manifest numerous and diverse profile features and activity patterns. Thus, feature-based Sybil detection (e.g., Machine-Learning-based) rarely yields desirable accuracy. As a result, the detected accounts cannot be automatically suspended due to the high rate of false positives. Instead, OSNs are employing a time-consuming manual account verification process driven by user-reports on abusive accounts and automated classifiers. However, only a small fraction of the inspected accounts are indeed fake, which signifies an inefficient use of human labor.
If an OSN provider could detect Sybil nodes in its system effectively, the experience of its users and their perception of the service could be improved by stemming annoying spam messages and invitations. The OSN provider would also be able to increase the marketability of its user base and its social graph for advertising purposes, and to enable other online services or distributed systems to use a user's OSN identity as an authentic digital identity, a vision foreseen by recent efforts such as Facebook Connect.
On the other hand, although social-graph-based Sybil defences have been extensively discussed in the research community (they are briefly described below) there has been little evidence of wide industrial adoption due to their shortcomings in terms of effectiveness and efficiency.
Social-network-based Sybil detection mechanisms rely on the properties of a social is network graph to separate Sybils from non-Sybil users. Yu et al. pioneered the field with two approaches of Social-network-based Sybil detection mechanisms: SybilGuard and SybilLimit, both of which sample random route (a special type of random walk) traces to infer Sybils. Both mechanisms bound the number of accepted Sybils based on the assumption that the attack edges between Sybils and non-Sybils form a minimum quotient cut in the social network.
SybilGuard, disclosed in “SybilGuard: Defending Against Sybil Attacks via Social Networks”, SIGCOMM, 2006, by Yu et al., bounds the number of accepted Sybils per attack edge to O(√{square root over (n)} log n), where n is the number of non-Sybil nodes in a social network and when the number of attack edges is on the order of o(√{square root over (n)} log n).
SybilLimit, disclosed in “SybilLimit A Near-Optimal Social Network Defense Against Sybil Attacks”, IEEE S&P, 2008, by Yu et al., improves over SybilGuard and limits the number of accepted Sybils per attack edge to O(log n), when the number of attack edges is o(n/log n).
SybilGuard and SybilLimit are designed to provide the above guarantees in a decentralized setting where each node acts as a verifier and independently decides whether to accept another node as a non-Sybil.
The solution proposed by Mohaisen et al. In “Keep your friends close: Incorporating trust into social network-based Sybil defensas”, INFOCOM, 2011, strengthen existing Sybil defenses, such as SybilLimit, by incorporating social trust into random walks. to Mohaisen et al. observe that some social networks have a higher mixing time than assumed in previous work, and propose to use weighted random walks where the node transition probability to a neighbor is weighted by the pairwise trust between nodes. The mixing time of a social network is the maximum number of steps that a random walk needs to traverse so that the probability of landing at each node is reaches the stationary distribution. The stationary distribution describes the probability of landing at each node after an infinite number of steps.
Other existing social-graph-based Sybil defences include the following:
SybilInfer, described in “SybilInfer Detecting Sybil Nodes using Social Networks.”, NDSS, 2009, by Danezis et al., infers the probability of a suspect user being non-Sybil in centralized systems. It also assumes attack edges form a sparse cut in a social network. It uses a machine learning method, the Metropolis-Hastings (MH) algorithm, to sample N non-Sybil user sets based on random-walk traces. Counting the user occurrence in N samples, SybilInfer generates a marginal probability for each node being non-Sybil. The marginal probability distribution is close to uniform in the absence of a Sybil attack, whereas it skews towards non-Sybil users under an attack. However, unlike SybilLimit and SybilGuard, SybilInfer does not specify the expected number of Sybils accepted per attack edge (false negatives) or the likelihood that a non-Sybil is declared as a Sybil user (false positives). SybilInfer also incurs a higher computation cost: O(s n (log n)2), where s is the number of trusted (referred to as honest) seeds.
SumUp, described in “Sybil-Resilient Online Content Rating”, NSDI, 2009, by Tran et al. is an online vote collection system that exploits the social network graph to discard votes from Sybil users. It defines a voting region in the network by assigning decreasing numbers of tickets (used as link capacity) to links with increasing to distances from a trusted vote collector, and approximates the multiple-source max-flow from the voters to a single trusted vote collector.
GateKeeper, described in “Optimal Sybil-resilient Node Admission Control”, INFOCOM, 2011, by Tran et al., builds on SumUp to admit non-Sybil users in a distributed system. It uses multiple ticket sources to reduce the probability that non-Sybil users do not receive tickets. In random expander social networks, GateKeeper accepts O(log g) Sybils per attack edge, where g is the number of attack edges and g is in the order of O(n/log n). It costs O(s n log n), because the computational cost of each max flow heuristic run is O(n log n).
Sirivianos et al. describe a maximum-flow-based method (called MaxTrust) for assigning scores on users that depend on their likelihood to be Sybils, as disclosed in “Assessing the Veracity of Identity Assertions via OSNs”, COMSNETS, 2012. The costs of MaxTrust is O(Tmax n log n). The integer value Tmax determines the granularity of the scoring.
BotGraph, disclosed in “BotGraph: Large Scale Spamming Botnet Detection”, NSDI, 2009, by Zhao et al. detects bot-users by constructing a large user-user graph based on shared email IP addresses and looking for tightly connected user groups. This technique is also applicable in social graphs to detect bot(fake) users.
Unlike the aforementioned Sybil detection mechanisms, which aim at explicitly distinguishing Sybils from non-Sybil users, another approach is to use a Sybil-resilient design that aims to make a particular application (e.g., a recommendation system or a DHT) resilient to Sybil attacks. For Instance, Post et al. propose Bazaar, described in “Bazaar Strengthening User Reputations in Online Marketplaces”, USENIX NSDI, 2011, which is an improved reputation system for online marketplaces. A key advantage of this design approach is that it can use application-specific knowledge to mitigate Sybils in the application, such as limiting the number of votes collected from Sybil users. However, a Sybil-resilient design optimized for an application may not be applicable to other systems, while a social-network-based Sybil detection mechanism identifies potential Sybils, and can be applied to all applications where users have social connections.
Some reputation systems use the power iteration method, which simulates a random surfer's walk across the web pages, beginning at a state and running the (random) walk for a large number of steps and keeping track of the visit frequencies for each of the states. A random walk is a mathematical formalization of a path that consists of a succession of random steps. PageRank reflects the stationary distribution of a random walk that at each step, with a probability ε, usually called the teleport probability, jumps to a random node and with probability 1−ε follows a random outgoing edge from the current node.
Personalized PageRank (“Topic-sensitive pagerank” by Haveliwala at al., WWW '02: Proceedings of the 11th international conference on World Wide Web, pages 517-526, 2002), EigenTrust (“The EigenTrust algorithm for reputation management in P2P networks” by García-Molina et al., WWW '03 Proceedings of the 12th international conference on World Wide Web, p. p. 640-651, 2003) and TrustRank (“Combating Web Spam with TrustRank” by García-Molina et al., VLDB, 2004) are trust Inference mechanisms that use power iteration to compute the stationary a distribution, which describes the probability of a random walk starting from one or multiple trust seeds to land at a node. Trust seeds are nodes that are a priori considered trustworthy and are used to initialize the trust computation. The power iteration includes a reset probability for the random walk to jump back to trust seeds. They use this distribution to assess how close and connected a node is to the trust to seeds, a metric that reflects how trustworthy a node is. However, these methods cannot be used to accurately detect Sybils, because non-Sybils far from and Sybils close to the seeds obtain low and high trust ranks, respectively.
The technical report “The PageRank citation ranking: Bringing order to the web”, Stanford InfoLab, 1999, by Page et al., describes the PageRank method, which uses is power iteration in a similar fashion as EigenTrust, TrustRank and Personalized PageRank, but with a constant reset probability jumping to random nodes in the directed graph. However, PageRank is not Sybil-mitigating because its random walks are reset to any random node, including Sybils.
Lin et al. disclose in “Power Iteration Clustering” ICML, 2010, the use of (an early-terminated) power iteration to cluster a set of vectors based on a similarity matrix. However, the early-terminated power iteration does not compute the probability of a random walk landing at a node. Thus, it is unclear how to apply this method to effectively detect Sybils.
Viswanath et al. proposed community detection (CD) algorithms such as Mislove's algorithm, disclosed in “You are Who you Know: Inferring User Profiles in Online Social Networks”, ACM WSDM, 2010, for Sybils detection. However, CD algorithms do not provide provable guarantees and Mislove's CD algorithm costs O(n2).
In response to the inapplicability of automated account suspension, OSNs employ CAPTCHAs (standing for “Completely Automated Public Turing test to tell Computers and Humans Apart”) to rate-limit suspected users or manually inspect the features of accounts reported as abusive by other users. The inspection involves matching profile photos to the age or address, understanding natural language in posts, examining the friends of a user, etc. Confirmed fake accounts are always suspended or deleted. However, these tasks require human intelligence and intuition, which makes them hard to be automated and to scale up.
Due to the high false positives of binary Sybil/non-Sybil existing classifiers, manual inspection needs to be part of the decision process for suspending an account Consequently, it is desirable to efficiently derive a quality ranking, in which a substantial portion of Sybils ranks low and enables the OSN provider to focus its is manual inspection efforts towards the end of the list, where it is more likely to encounter Sybils.
Therefore, there is a need in the state of the art for a method for unveiling fake OSN accounts that reliably allows human verifiers to focus on a small number of user accounts that are very likely to be fake (Sybil accounts), and to provide the OSN provider with a ranking list to assist with determining whether to challenge suspicious users, for instance by means of CAPTCHAs.