HyperText Markup Language (HTML) is the predominant markup language used for delivering and displaying Web pages. It provides a means to describe the structure of text-based information in a web page by denoting certain text as links, headings, paragraphs, lists, and so on. HTML code can be written by a web designer, and/or can be generated automatically. A web server provides the HTML code in response to a request for a web page in accordance with the HyperText Transfer Protocol (HTTP). The HTML code is then received by a web client and rendered by a web browser as a web page for viewing. HTML code can also supplement markup text with interactive forms, embedded images, and other objects. HTML code is written in the form of tags, surrounded by angle brackets.
HTML comprises components called “elements.” Elements provide the basic structure for HTML markup. Elements have two basic properties: attributes and content. Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid. An element has a start tag (e.g., <element-name>) and usually an end tag (e.g., </element-name>). The element's attributes are contained in the start tag, and any content is located between the start and the end tags (e.g., <element-name attribute=“value”>Content</element-name>). Some elements, such as <br>, do not have any content and do not have a closing tag.
A malicious user can attack the operation or performance of a web server, such as by gaining unauthorized access to the server and changing web page code, operating parameters, or the like, or by taking advantage of web programming weaknesses, etc. One way to detect an attack is to examine the contents of web pages provided by a web server in response to HTTP requests. The process of scanning web pages can be automated to a degree, such as by detecting changes to a web server's HTTP responses to HTTP requests for a web page, such as changes in the response time, or changes in web pages provided in response to identical requests. One challenge faced by automatic web scanners is that there are many things besides an attack that can cause a web server response to change, such as changing ad banners, time-stamps, page hit counters, and the like. Typically, such content can change even if web page requests are identical. As used herein, the term “dynamic” is used to indicate web page content that changes in the responses to identical web page requests. The term “static” is used to indicate web page content that does not change in the responses to identical web page requests.
Differential analysis is a technique used to compare two or more HTTP responses to determine if there are differences between them, and if so, identify the differences. One of the challenges of using differential analysis to analyze web pages is unknown web application behavior. For example, HTTP responses may contain dynamically generated content such as ad banners, page request counts, time stamps, and other elements that are independent of the request parameters. Two responses to identical requests can be the same with regard to the matter that was requested, but the responses may be different because their dynamic content has changed.