This invention relates generally to detecting website cloaking by extracting features of URL redirects and providing the features to a machine-learning based model configured to predict the likelihood of the website performing cloaking.
Online systems often enforce policies regarding what content can be posted to the online system and what content can be linked to content distributed by the online system. For example, an online social networking system may restrict users from posting and linking to certain types of content, such as adult content, violent content, threats, content related to criminal activity, or fraudulent content. To enforce these policies, the online system monitors content and blocks content that is determined to be in violation of a policy. To thwart the online system's ability to detect linked content that violates a policy, certain websites perform cloaking of the content they publish via the online system.
Websites perform cloaking by providing different content to different users. For example, a website may identify a user that is requesting content from the website, or identify information describing the device, such as the device's IP address. The website then provides “good” content to devices that are determined to be within an online system that enforces a policy, such as devices used for monitoring and maintaining an online system, for example, a social networking system. The website provides “bad” content (e.g., content that is in violation of a policy) to other devices, such as devices that are used by users of the online system and that are identified as being external the online system. The good content shown to devices within the online system “cloaks” the content that is shown to external devices, making it difficult for the online system to determine the true nature of the content that the website is delivering to the external users of the online system. Conventional techniques fail to detect policy violations by websites that perform cloaking.