One major problem facing modern computing systems and communications systems is the prevalence of spam messages. Spam messages are a serious issue not only in e-mail systems, but also in Short Message Service (SMS), Instant Messaging (IM), and in virtually every other form of electronic communication.
One form of spam that has become more and more common is a spam message that includes a Uniform Resource Locator (URL) that, when activated, links, or redirects, to one or more websites that include unsolicited, malicious, unwanted, offensive, or nuisance content.
One method that could be used to determine if a message including a URL is potential spam, i.e., is “spammy”, is to analyze the included URL by one or more URL analysis methods such as, but not limited to: analyzing various portions of the URL; activating the URL link to the associated webpage; and/or analyzing the contents of the webpage linked to by the URL. However the prevalence of URL shortening services, and other types of redirects, has significantly complicated traditional URL analysis, and in particular, has made accessing a webpage, and the content of a webpage, associated with a URL far more difficult.
URL shortening services typically provide users, including spammers, the ability to shorten the size, or number of characters, associated with a given URL by providing shortened URLs that map, or redirect, to the longer actual URL. URL shortening services are legitimately used to allow the URL to be included in text size limited communications, such as Twitter™. On the other hand, spammers can use URL shortening services to mask an actual spam URL, and associated webpage content, by having multiple shortened URLs created that redirect to the same actual URL and/or each other.
Spammers have recently begun to regularly use URL redirects, including URL shortening service related URL redirects. In fact, many spammers now routinely employ a deeply nested series of URL redirects of various types, to frustrate, and/or avoid, URL analysis and the retrieval of associated webpage content.
Currently, redirects, and particularly nested redirects make it difficult, if not impossible, to identify and block the spam because, using URL redirects, the spammy URL content can be hidden by way of a redirect shell game that prevents currently available link-following, and/or security systems from automatically accessing the actual URL efficiently in a reasonable amount of time. Therefore, simply attempting to retrieve the content at an included URL will no longer reliably yield webpage content for analysis.
To further complicate the situation, redirects used by a spammer can be one or more of many different types of redirects, such as, but not limited to: Hypertext Transfer Protocol (HTTP) redirects; Hypertext Markup Language (HTML) Meta redirects; and JavaScript redirects, and can include other issues such as tracking bugs, Document Object Model (DOM) manipulation, and incorrect HTTP response codes. In addition, the number of redirects that can be employed by spammers is effectively unlimited. Therefore, some spammers use multiple types of redirects, and/or a high number of redirects, to frustrate analysis. Consequently, it is not sufficient to simply have lists of sites/URLs for which redirects should be handled because there are too many sites, and too many different methods for redirection available to the spammer.
In order to effectively, and efficiently, perform URL analysis redirects must be recognized and a determination must be made as to which type of redirect is in use, so that the URL related content can be obtained by traversing as many redirects as possible. However, pitfalls associated with redirect loops such as, but not limited to: extremely long chains of redirects such as are used in some denial of service attacks and tar-pitting, i.e., very slow redirects, must also be avoided. Currently available link-following, and/or security, systems typically fail to meet these criteria.
Related to the problem of redirects is the issue of tracking bugs. Tracking bugs are typically small pieces of code or images that must be executed or retrieved in order to obtain the URL webpage content required. Although, in some cases it might be possible to retrieve content without retrieving the tracking bug, often this lack of retrieving the tracking bug is noted by the site's operator and will cause the connecting Internet Protocol (IP) address, i.e., the URL analysis system, to be banned at Domain Name System (DNS) level from all sites hosted on that system; thereby effectively blocking a current link-following, and/or security system from following the current URL and any other associated/hosted URLs.
Another associated issue is that of DOM manipulation. DOM is a cross-platform and language-independent convention for representing and interacting with objects in HTML, Extensible Hypertext Markup Language (XHTML), and Extensible Markup Language (XML) documents. Aspects of the DOM, such as its “Elements”, may be addressed and manipulated within the syntax of the programming language in use. Using DOM an HTML webpage's content can be changed, or even populated, via JavaScript once the webpage is loaded. This can be an effective way for spammers to hide content using JavaScript since any new content or changes made by JavaScript on the page would only become apparent when inspecting the DOM after any JavaScript has been executed.
In addition, JavaScript in particular can raise several difficult issues that cannot be solved by current redirect identification and link following systems, or simply having lists of sites/URLs for which redirects should be handled. Use of JavaScript in webpages linked to from spam messages is increasing. In addition, the malicious use of JavaScript is evolving rapidly and spammers have realized that currently available naive anti-spam systems are largely powerless to detect it.
This is due, in part, to the fact that JavaScript itself is a rich and dynamic programming language which offers spammers an almost unlimited range of options when it comes to obfuscating code and making it otherwise hard to analyze. Consequently, simply retrieving the content of a linked to a URL is not, in and of itself, sufficient to reliably obtain webpage content because spammers have started making significant use of obfuscated JavaScript redirects, additional executable JavaScript, and/or hidden content that is added dynamically to redirect webpages when they are rendered to conceal their spamming payloads and/or redirect chains.
Given the increased use of these techniques in webpages linked to by spam, anti-spam redirect identification and link following systems that do not address this fact can be insufficient and vulnerable to these now commonplace JavaScript issues.
As discussed above, current link-following, and/or security systems are often unable to provide an efficient and reliable system for accessing, and analyzing the webpage content associated with URLs included in messages that are redirects, and/or include tracking bugs, and/or include DOM manipulation, and/or include JavaScript redirects, the addition of executable JavaScript, and content dynamically written to the redirect webpage. As a result, currently many URLs included in messages cannot be analyzed in a reasonable time to determine if a message is spam, i.e., if the included URL is spam related. Therefore, many of these nuisance, and at times harmful, messages and included URLs currently find their way to thousands of victims each year. Clearly this is a far from ideal situation for the victims, but it is also a problem for all users of message systems who currently must suffer with the delays, and false positives, and/or must be wary of all messages, even those of seemingly legitimate origin and intent.