Field of the Invention
The present invention is generally related to identifying structural differences, as potentially distinct from content differences, between possibly related documents and, in particular, to identifying relevant structural differences between documents realized by networked computer systems in the form of document object models for purposes of, among others, efficiently detecting and characterizing potential security vulnerabilities.
Description of the Related Art
The use of the World Wide Web, often simply referred to as the Internet or the Web, has grown over the recent years to the point where Internet access is an essential component of most all commerce, entertainment, social and business communications, and educational activities. Growth, both in terms of essential significance to users and frequency of use, is widely expected to continue for quite some time. Indeed, perhaps the most defining characteristic of the Internet is the ability to automatically and efficiently route vast amounts of data between users and Web site servers largely independent of the distributed geographic locations of the users and servers.
Unfortunately, perhaps the second most defining characteristic of the Internet is that virtually any presence on the Internet, whether present as a Web browser or Internet server, creates a security exposure. The threats to the client computer systems that execute Web browsers are generally well known. Anti-virus and other anti-malware client programs are available to protect client systems. Conversely, Internet servers and in particular Internet Web servers represent complex, often highly customized systems that are not generally amenable to generic protection schemes. Moreover, the content and function of the Web sites hosted by Internet Web servers are constantly subject to change as appropriate for the commerce, entertainment, social and business communications, and educational activities hosted by the site. As the site presence changes, the nature and extent of the site security vulnerabilities also change. Whether pursued for purposes of economic or privacy theft, industrial espionage, or vandalism, protecting Web sites against security exposures is an ongoing, difficult, and expensive imperative.
In general terms, most information exchanged over the Internet, including specifically the information provided by Web servers, is organized by data exchange protocols, site defining domains, and document paths. Together, these elements make up a Uniform Resource Locator (URL) or, more generally, a Uniform Resource Identifier (URI). The form and usage of URLs and URIs are standardized through the work of the World Wide Web Consortium (W3C; www.w3c.com), an international community that develops open standards to ensure the long-term growth of the Web.
For Web servers, information is exchanged using the HyperText Transfer Protocol (HTTP), also as standardized through the work of the W3C. Using the HTTP protocol for transport, Web information is exchanged in an encoded form as defined by the Hypertext Markup Language (HTML), again as developed through the work of the W3C.
The domain identifier portion of a URL is used to identify the site of a Web server. The domain identification may resolve to an actual, proxy, or virtual site somewhere accessible via the Internet, though typically one that is in some manner appropriate to respond to HTTP requests, among others. The path portion of the URL nominally provides a path-oriented selector of a particular document, typically representing a Web page, from a collection of such documents hosted by the domain identified Web server. Thus, a user can retrieve, on demand, most any identifiable document from a domain identified Web site.
In response to a URL request, information representing the corresponding Web page is transferred to the user. Typically, this Web page data is received and rendered by a client Web browser executed on a computer system local to the user. Although there are many different specific implementations, typically a rendering engine embedded within the client Web browser executes to decode and parse the received Web page HTML data into an internal data structure generically known as a document object model (DOM). From the DOM, the rendering engine then defines and transfers a graphical representation of the Web page into the local memory of the client Web browser display device. The client Web browser can also operate to capture user actions and selections, including data entered through Web page forms, and related information designated for capture by the HTML and enscripted coding of the Web page. The captured data is then transferred back to the Web site server or other designated computer system using a HTTP defined transfer method.
A Web site can be as simple as a single, statically defined Web page. Other sites can host Web page document collections that range, in effect, from hundreds to tens of thousands or even millions of distinct Web pages, all of which can be transferred on demand to a client browser. Conventionally, such larger sites, sites hosting frequently changing content, user interactive sites, and others subject to specialized needs, will utilize Web servers with a dynamic Web page generation capability. Dynamic page generation systems typically operate in near real-time to construct Web pages in response to URL-defined requests. Information captured from user actions and inputs can be also used to dynamically define or influence the constructed appearance and content of a generated Web page. This also allows information produced or gathered from other sources, perhaps other users or third-party data feeds, to be dynamically composed into the generated Web pages. Since, these Web pages are dynamically generated in direct response to a client Web page request, the generated Web page will desirably present the most current available information. Even as between simultaneously received, otherwise identical requests from different users, a Web page generator can produce different instance Web pages based on external and user specific information, such as inferred geographic location, preferred language, expressed interests, past browsing history, and other similar factors determined in relation to the Web page request as received.
Although technically complex, the access barrier to receiving Web pages and providing for the return of user data is low. Most any current computer system, network appliance, or other client device capable of Internet access can interact with remote sites through HTTP requests and HTML content-based responses. Given that the HTTP protocol is conventionally implemented on a layered stack of network communications protocols, a similarly low barrier exists for client and remote server interactions using any of these other protocol layers. Often, highly interactive Web sites, sites that offer enhanced or specialized services, and other similarly complex Web sites will often utilize elements or functions provided by these other communication protocol layers.
From a security point of view, every host server operation executed and every protocol layer used to receive and respond to a Web browser URL request represents a risk of an exploit that could compromise the operation or integrity of the Web server computer system. These risks can range, in various forms, from denials of service to interference with the proper operation of different elements of the Web server computer system. In addition, these risks include breaches that allow injection of corrupting operations or outright access to sensitive or confidential information held by or accessible from the Web server system.
Consequently, a need exists for a system and methods for continuously ensuring that security exposures in any Web server system can be identified and managed before they can be exploited without imposing excessive performance penalties or altering the current low barrier to access enjoyed by users.