One major problem facing modern computing systems and communications systems is the prevalence of spam messages. Spam messages are prevalent, and a serious issue, not only in e-mail systems, but also in SMS, Instant Messaging (IM), and other text based, messaging systems, and in virtually every other form of electronic communication. In addition, spam-like activity and function as begun to be implemented in images distributed by various means.
One form of spam that has become more and more common is a spam message that includes a URL that, when activated, links to one or more websites that include unsolicited, malicious, unwanted, offensive, or nuisance content, such as, but is not limited to: unsolicited or unwanted pornographic content; any content that promotes and/or is associated with fraud; any content that includes “work from home” or “be our representative” offers/scams; any content that includes money laundering or so-called “mule spam”; any content that promotes and/or is associated with various financial scams; any content that promotes and/or is associated with any other criminal activity; and/or any content that is unsolicited and/or undesirable, whether illegal in a given jurisdiction or not.
One method that could be used to determine if a message including a URL is potential spam, i.e., is “spammy”, is to analyze the included URL by one or more URL analysis methods such as, but not limited to: analyzing various portions of the URL; activating the URL link to the associated web page; and/or analyzing the contents of the web page linked to by the URL. However the prevalence of URL shortening services, and other types of redirects, has significantly complicated traditional URL analysis, and in particular, has made accessing a web page, and the content of a web page, associated with a URL far more difficult.
URL shortening services typically provide users, including spammers, the ability to shorten the size, or number of characters, associated with a given URL by providing shortened URLs that map, or redirect, to the longer actual URL. URL shortening services are legitimately used to allow the URL to be included in text size limited communications, such as Twitter™. On the other hand, spammers can use URL shortening services to mask an actual spam URL, and associated web page content, by having multiple shortened URLs created that redirect to the actual URL, and/or each other.
Spammers have recently begun to regularly use URL redirects, including URL shortening service related URL redirects. In fact, many spammers now routinely employ a deeply nested series of URL redirects of various types, to frustrate, and/or avoid, URL analysis, and the retrieval of associated web page content.
Currently, redirects, and particularly nested redirects, make it difficult, if not impossible, to identify and block the spam because, using URL redirects, the spammy URL content can be hidden by way of a redirect shell game that prevents currently available link-following, and/or security, systems from automatically accessing the actual URL efficiently in a reasonable amount of time. Therefore, simply attempting to retrieve the content at an included URL will no longer reliably yield page content for analysis.
In addition, to further complicate the situation, redirects used by a spammer can be one or more of many different types of redirects, such as, but not limited to, HTTP redirects, HTML Meta redirects, and JavaScript redirects, and can include other issues such as tracking bugs, DOM manipulation, and incorrect HTTP response codes. In addition, the number of redirects that can be employed by spammers is effectively unlimited. Therefore, some spammers use multiple types of redirects, and/or a high number of redirects, to frustrate analysis. Consequently, it is not sufficient to simply have lists of sites/URLs for which redirects should be handled because there are too many sites, and too many different methods for redirection available to the spammer.
In order to effectively, and efficiently, perform URL analysis methods that require activating the URL and linking to the associated web page, redirects must be recognized and a determination must be made as to which type of redirect is in use, so that the URL related content can be obtained by traversing as many redirects as possible. However, pitfalls associated with redirect loops such as, but not limited to, extremely long chains of redirects, such as are used in some denial of service attacks, and tar-pitting, i.e., very slow redirects, must also be avoided. Currently available link-following, and/or security, systems typically fail to meet these criteria.
Related to the problem of redirects is the issue of tracking bugs. Tracking bugs are typically small pieces of code or images that must be executed or retrieved in order to obtain the URL page content required. Although, in some cases, it might be possible to retrieve content without retrieving the tracking bug, often this lack of retrieving the tracking bug is noted by the site's operator and will cause the connecting IP address, i.e., the URL analysis system, to be banned, typically at DNS level, from all sites hosted on that system; thereby effectively blocking a current link-following, and/or security, system from following the current URL and any other associated/hosted URLs.
Another associated issue is that of Document Object Model (DOM) manipulation. DOM is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Aspects of the DOM, such as its “Elements”, may be addressed and manipulated within the syntax of the programming language in use. Using DOM, an HTML page's content can be changed, or even populated, via JavaScript once the page is loaded. This can be an effective way for spammers to hide URL content.
In addition, some servers return non-traditional, or incorrect, HTTP status codes, such as 404 for a deleted short link. Many current link-following, and/or security, systems treat this as an error, and therefore end analysis and attempts to obtain web-page content.
As discussed above, current link-following, and/or security, systems are often unable to provide an efficient and reliable system for accessing, and analyzing the web page content associated with, URLs included in messages that are redirects, and/or include tracking bugs, and/or include DOM manipulation. As a result, currently, many URLs included in messages cannot be analyzed in a reasonable time to determine if a message is spam, i.e., if the included URL is spam related. Therefore, many of these nuisance, and at times harmful, messages, and included URLs, currently find their way to thousands of victims each year. Clearly, this is a far from ideal situation for the victims, but it is also a problem for all users of message systems, who currently must suffer with the delays, and false positives, and/or must be wary of all messages, even those of seemingly legitimate origin and intent.