The present invention relates to systems and methods for preventing malicious attacks contained within user-generated content in an online computing setting. Traditionally, sophisticated third parties like publishers produced content for users via the Internet. Increasingly, however, users are interested in more interactive experiences. These users don't merely want to consume content, they wish for more immersive, participatory experiences. Today, users are significant creators of user-generated content.
As such, user-generated content has become a rapidly expanding field. Users 110 typically create content through interacting with web applications 120 from desktop web browsers 130, mobile web browsers 140, third-party client widgets 150, third-party client libraries 160, and application programming interfaces (APIs) 170. These are the most popular mechanisms of contributing user-generated content over Hypertext Transfer Protocol (HTTP). Often, user-generated content may contain text (plain or localized in an international language), hypertext markup language (HTML), cascading style sheet (CSS) information, and JavaScript (JS), among other known script variants. User-generated content is delivered as strings and/or sequences of bytes to web applications 120 via a communications network 180, such as HTTP, or read from data persistence stores 190, such as databases, caches, or queues.
With the proliferation of user content, there has been an equally robust increase in the number of attacks embedded in user-generated content. These attacks enable malicious parties to gain personal (and potentially sensitive) information on users, redirect users to malicious websites, track user browsing behavior, and otherwise take advantage of users, often without them being aware of the attack.
User-generated content can contain two significant attack variants: cross-site scripting (XSS) or structured query language (SQL) injection. An XSS attack exploits security vulnerabilities found in web applications. XSS enables an attacker to inject a client-side script into web pages viewed by other users, allowing said attacker to bypass access controls. XSS is possible through malicious JS tags/attributes/protocols, CSS properties, and rich media tags. XSS attacks accounted for roughly 84% of all security vulnerabilities documented by a major security firm in 2007. XSS attacks have the ability to read and write web browser cookies (containing private user data), create web application requests on behalf of a user without acknowledgement, redirect users to malicious websites, as well as other behaviors that take advantage of a user's trust.
In contrast, SQL injection is designed to attack data-driven applications. This is accomplished by providing fragments of a SQL query into an input variable, supplied by a web application user. When the input is evaluated by the application, the tainted SQL query is executed, allowing attackers to CRUD (create, read, update, delete) information (potentially sensitive) from a database.
Currently, a number of techniques exist to reduce the danger of user-generated content attacks. The most commonly employed techniques utilize filters that attempt to prevent XSS and SQL injection attacks by using a “blacklist” to remove content. As used herein, the term “blacklist” means a source of information that enumerates a list of pre-defined attacks to be removed. The process of using the blacklist to perform transformations employs a strategy of applying heuristics via string and regex (regular expression) replacements. At runtime, this flow typically looks like:                a) Load a blacklist from disk/memory;        b) Verify the integrity of the blacklist;        c) Iterate through the blacklist while generating key/value objects as a representation of the blacklist (typically performed to avoid heavy disk reads and unnecessary computation cycles).        
After the blacklist has been loaded, it can be used to remove malicious content and potential content attacks. The blacklist process is comprised of the following steps:                a) Iterate through each of the key/value objects that represent the blacklist;        b) Perform a string/regular expression replacement with each of the objects, thereby transforming the original content;        c) Return the transformed content.        
Unfortunately, current methods utilizing blacklists for the filtering of content attacks in user-generated content is insufficient to prevent many of said attacks from being successful or otherwise obstruct the content. This is because blacklist-based security filtering suffers from three major drawbacks. The first drawback is that these filters are employed in an iterative model way to remove first-level attacks and completely miss nested attacks. One example is a concatenation-based attack that comes together post-blacklist filtering.
Secondly, these existing blacklist-based filters run the risk of removing fragments of content that may resemble HTML, CSS, and SQL injection but are not. The intent and fidelity of the source content therefore has the potential of being ruined.
Lastly, and possibly most importantly, these filters are immediately outdated as new attack variants are discovered. The entire system is built on top of existing attack definitions and is unreactive to new attacks. Consequently, a system like this has virtually no defense against undefined and newly discovered attacks, such as “zero-day” exploits.
It is therefore apparent that an urgent need exists for improved systems and methods for preventing attacks against user-generated content. Such systems and methods enable attack prevention that are not reactive to the introduction of new attacks, and may prove to be more accurately able to prevent attacks than current systems.