1. Technical Field
This disclosure relates generally to content filtering.
2. Background of the Related Art
Unsolicited commercial email, commonly known as “spam,” clogs mail servers and e-mail inboxes. Most existing solutions to prevent spam today are based on a content filter, which examines the text of an e-mail and uses a set of rules to determine if the recipient might want to receive it. The rules include “blacklists” and “whitelists.” Blacklists and whitelists are commonly used by software to filter content. For example, email spam filters may remove emails based on a blacklist of known keywords. Or, safe emails may be retained based on the IP address of the email sender. Often, both whitelist and blacklist rules will be applied together, theoretically for improved results. Such rules, however, may conflict with one another with no clear resolution. For example, how should a spam filter handle an email with a spam keyword (identified on a blacklist) but sent by a whitelisted server? Or, how should a filter handle content when the filter rules affect only portions of the content rather than the entire message? One known approach is to apply whitelists and blacklists sequentially (e.g., apply a whitelist, then apply a blacklist, or vice versa). Another approach is to implement a rules engine rather than an explicit blacklist or whitelist, where the engine implements a hierarchical tree of decision nodes, each of which may be a conditionally-applicable whitelist or blacklist action. The result is a chain of applied whitelist/blacklist actions. While these techniques provide some advantages, in effect they amount to multiple-pass filtering, which is computationally intensive and still may not resolve rule conflicts adequately. As a consequence, even when these advanced techniques are used, unwanted spam may pass the filter, and legitimate email may be incorrectly tagged as spam.
Email is a simplified case where the filter chooses only to allow or block the email. The problem of disambiguating conflicting content filtering rules is exacerbated when the filter is an active content filter (ACF). Active content filtering is a technique to protect a computing device from potentially harmful active content (i.e., programs or code) embedded in downloaded HTML. Downloading refers to transferring a document, a file, or information from a remote computing device to a local computing device. Such downloading can occur over a local area network or over a wide area network, such as the Internet. If unfiltered, active content in a downloaded document may perform unwanted or unauthorized (generally referred to as harmful) actions on the local computing device, with or without the user knowing. To protect the local computing device from such actions, it is known to parse the contents of an HTML document to syntactically identify items within the document that are considered harmful. These items can vary, not only syntactically (e.g., multiple tags), but also in terms of granularity (e.g., tags, attributes, or specific values within attributes and tags). A record of such items may be kept within an editable configuration file. When new, potentially harmful items become known, an administrator or user can edit the configuration file to include these new items. Thus, the protection of the local computing device is able to keep pace with the development of new, potentially harmful active content. Except for those edits to the configuration file, changes to client-side or server-side software are not required to upgrade the filtering capability of the active content filter to respond to new forms of active content.
In the ACF, portions of the HTML content will pass the filter while other portions will be removed. For these types of content filters, conflicting content filtering rules may create further unintended results.