The power of social media is undeniable: may it be in a marketing or political campaign, sharing breaking news, or during catastrophic events. Unfortunately, social media has also become a major weapon for launching cyberattacks on an organization and its people. By hacking into accounts of (popular) users, hackers can post false information, which can go viral and lead to economic damages and create havoc among people. Another major threat on social media is the spread of malware through social media posts by tricking innocent users to click unsuspecting links [5]. Due to these reasons, organizations are developing policies for usage of social media and investing a lot of money and resources to secure their infrastructure and prevent such attacks.
Ascertaining the veracity (or trustworthiness) of social media posts is becoming very important today. For this, one must consider both the content as well as users' behavior. However, there are technical challenges that arise in designing a suitable method or system that can model and reason about the veracity of social media posts. The first challenge is to represent the complex and diverse social media data in a principled manner. For example, a tweet is a 140-character message posted by users on Twitter. It is represented using 100+ attributes, and attribute values can be missing and noisy. New attributes may appear in tweets; some attributes may not appear in a tweet. Hashtags, which begin with the # symbol, are used frequently by users in tweets to indicate specific topics or categories. There are thousands of hashtags in use today; the popularity of a hashtag changes over time. Some hashtags may become trending/popular during a particular time period. The second challenge is to construct a knowledge base (KB) on social media posts. The goal is to learn the entities, facts, and rules from a large number of posts. The third challenge is to reason about the veracity of the posts using the KB containing a large number of entities and facts. Thus, suspicious content/activities can be flagged as soon as possible to discover emerging cyber threats.
The invention described herein presents a system to solve the above challenges to discover cyber threats on Twitter [3]. The system provides a unified framework for modeling and reasoning about the veracity of tweets to discover suspicious users and malicious content. The system builds on the concept of Markov logic networks (MLNs) for knowledge representation and reasoning under uncertainty [4]. It can be used to analyze both the behavior of users and the nature of their posts to ultimately discover potential cyberattacks on social media. The nature of cyberattacks on social media is quite complex: It can range from posting of malicious URLs to spread malware, to posting of misleading/false information to create chaos, to compromise of innocent users' accounts. The system embodies a KB over tweets—to capture both the behavior of users and the nature of their posts. The KB contains entities, their relationships, facts, and rules. Via probabilistic inference on the KB, the system can identify malicious content and suspicious users on a given collection of tweets.
There are a few recent patented methods or systems to detect attacks on social networks such as for preventing coalition attacks [US 20140059203], preventing an advanced persistent threat (APT) using social network honeypots [US 20150326608], detecting undesirable content in a social network [US 20130018823], and preventing spread of malware in social networks [U.S. Pat. No. 9,124,617]. However, there is no published method or system that has (a) employed MLNs for modeling tweets and users' behavior as a KB and (b) applied probabilistic inference on the KB for discovering suspicious users and malicious content.