With the advent of computers and distributed computing environments, including the Internet, and a variety of intranets, computer users regularly receive electronic documents containing HTML for viewing in a web browser. For example, computer users may receive Web pages containing graphic images and hypertext links to other Web pages in the body of electronic mail (e-mail) messages from friends, business associates, and advertisers. These electronic documents may also contain scripts written in HTML for executing programmatic content without user interaction, such as displaying animation to a user when viewing HTML in a web browser.
Unfortunately, often computer users receive HTML in electronic documents from unknown senders that is undesirable or even malicious in some manner. Similarly, sometimes computer users receive undesirable or malicious HTML in electronic documents from senders users believe are familiar to them. For example, a computer user may receive HTML in e-mail from an Internet or Webmail service provider such as HOTMAIL (http://www.hotmail.com) (provided by THE MICROSOFT NETWORK® online service (http://www.msn.com), YAHOO!®MAIL (http://mail.yahoo.com), MAIL.COM™ (http://www.mail.com), and WHOWHERE? (http://whowhere.lycos.com). These Webmail providers typically use ‘cookies’ for authenticating users who visit their websites. Cookies are unique pieces of data transmitted from an Internet content provider to a user's computer when a user visits a website so that the content provider can recognize the user (cookies may also be stored on a user's computer so that a website may recognize the user when the user revisits the site). When the Webmail provider recognizes a cookie that stored on the user's computer, the provider may login the user using a user name and password stored in the cookie. HTML sent in an electronic document by a malevolent hacker to a user of one of these Webmail providers may execute a malicious script in the background to “steal” a user's cookie (and thus the user's login information), when the user opens an e-mail message and views the content of the document in a browser. With the user's login information obtained, the hacker may then login to the user's e-mail account to access the user's address book, send e-mail, read the user's e-mail, etc. In addition to “cookie stealing,” malevolent hackers may also obtain a user's personal information through cross page communication or through the abuse of site locked ActiveX controls.
To counter malicious scripting attacks written in HTML, software developers have employed filters to prevent known harmful content from running on a user's computer. For example, Webmail providers, such as HOTMAIL, currently provide filters for blocking HTML scripts (such as JavaScript) from incoming e-mail messages before they are opened by the user. These current filters utilize a “dictionary attack” by comparing an HTML file to a list of known possible patterns to identify harmful content. These current filtering methods, however, only look for known scripting attacks or patterns in HTML that are listed in the dictionary, without examining the structure of the HTML itself. Thus, if the filter encounters HTML written in an unrecognized pattern, the HTML is “passed through” the filter to the user. Savvy hackers have discovered and exploited these “filtering holes” by incorporating malicious HTML in otherwise harmless content. For example, a malicious script contained in an HTML image element for displaying an image to a user would not be blocked by the filter because this attack is unknown to the filter.
Therefore, there is a need in the art for a method and system for filtering malicious HTML by examining the structure of HTML in electronic documents to prevent the execution of harmful programmatic content by a computer user.