A Web browser or Internet browser is a program used to view HTML (Hypertext Markup Language) documents. A Web browser translates an HTML document comprising one or more HTML codes into the text and graphics displayed when viewing a Web page. Currently the most common Web browser is Internet Explorer® technology-enabled browser, available from Microsoft Corporation of Redmond, Wash. Netscape Navigator is another Web browser and is available from Netscape Communications Corporation of Mountain View, Calif.
Secure Socket Layer (SSL) is commonly-used protocol for managing the security of a message transmission on the Internet. SSL is included as part of both the Microsoft and Netscape browsers and most Web server products. SSL uses the public-and-private key encryption system from RSA Security Inc. of Bedford, Mass., which also includes the use of a digital certificate. A digital certificate is an electronic “credit card” that establishes a user's credentials when doing business or other transactions on the Web. A digital certificate is issued by a certification authority (CA) and typically contains a user's name, a serial number, one or more expiration dates, a copy of the certificate holder's public key, and the digital signature of the certificate-issuing authority so that a recipient can authenticate the sender.
While most websites provide SSL for sensitive data, users may configure their Internet browser to enable or disable SSL. Even if an Internet browser is adapted to enable SSL, a user may choose not to use SSL.
The term “spoof site” is used to describe a Web site created by an imposter with the purpose of tricking computer users into providing private information. Spoof sites are designed to copy the exact look and feel of a “real” site (e.g. http://www.americanexpress.com), but any information entered at the spoof site is received by the imposter, not the creators of the “real” site. After building such a site, the imposter typically sends an email with a message such as “Your account is limited,” or “We require additional information,” or “Due to a security breach, we need to verify your information.” This is known as “phishing.”
The phishing email typically includes a link to a website. The website address typically includes character strings that resemble the name of the “real” site (e.g. http://www.americanexpress.com/ . . . ), but in fact the email will include a URL containing a series of numbers, a string containing the URL of the “real” site followed by cryptic-looking information, or something that resembles an email address.
If the authentication features of SSL are not enabled, a computer user is unable to distinguish a real website from a spoof site. Users therefore interact with the spoof site, often entering sensitive information such as card numbers, account numbers, personal identification numbers (PIN), passwords, addresses, social security numbers, etc., thereby allowing the imposters access to the sensitive information.
FIG. 1 is a block diagram that illustrates how an Internet Explorer® technology-enabled Web browser processes a Web page received from a Web server. As shown in FIG. 1, a browser 120 receives HTML file 100 from a Web server 160. A pre-parser Web interface 125 of browser 120 converts the HTML codes in HTML file 100 to a canonical format to facilitate parsing of the HTML codes in HTML file 100, and stores the converted HTML file to a persistent storage medium such as a hard disk. The conversion typically includes changes such as inserting missing elements (such as HEAD tags), stripping quotes, capitalizing tag names, and converting open tags to closed tags. Thus the resulting parsed HTML file 150 retrieved by the Web interfaces 125 typically differs from original HTML file 100.
Typical solutions to phising include computing a hash value over the HTML codes corresponding to the HTML codes being displayed by the Web browser 120, and comparing the computed hash value to a hash value computed over the original HTML file 100. In one solution, the function UrlDowloadToFile( ) in Internet Explorer's URLMON library is used to read the data back in from disk. Unfortunately, the data read back is the parsed HTML file 150, not the original HTML file 100. For this reason validation results based on the parsed HTML file 150 could be erroneous. This is because hash codes are computed over the contents of an HTML file so the hash of original HTML file 100 will not match the hash of parsed HTML file 150.
In another solution, having to read the HTML codes back from persistent storage is avoided by calling the Internet Explorer® function CreateURLMoniker( ) and calling Imoniker's BindToStorage( ) function to retrieve the bytes via an IStream. The simpler Application Programming Interface (API) call UrlOpenStream( ) encapsulates this functionality. Unfortunately, this solution is inefficient as it requires re-downloading of the file. Additionally, the re-downloaded file is not the same physical file that is being displayed by the browser. For this reason, validation results based on the re-downloaded file could be erroneous.
A need exists in the art for an improved solution that allows an application program to verify the authenticity of pages displayed from an Internet site.