The present invention generally relates to analysis of Internet content for user access. More particularly, the present invention relates to pre-fetching linked content for analysis in advance of user access via a proxy.
A World Wide Web (Web) proxy pre-fetches content embedded in other content being cached and/or otherwise provided via the proxy. A proxy is a computer or Web service, for example, that offers a network service to allow clients to make indirect network connections to other network services. A client connects to the proxy server, then requests a connection, file and/or other resource available on a different server. The proxy provides the resource either by connecting to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's request or the server's response for various purposes.
Certain proxies may be implemented as Web proxies to attempt to block offensive Web content, for example. Web proxies may reformat Web pages for a particular purpose and/or audience; for example (e.g., reformatting Web pages for cell phones and personal digital assistants). Network operators can also deploy proxies to intercept computer viruses and other hostile content served from remote web pages.
Certain Web proxies are classified as “CGI proxies.” CGI or Common Gateway Interface proxies are Web sites that allow a user to access another Web site through the CGI proxy, for example. CGI proxies generally use a hypertext processor, such as PHP, or CGI to implement proxying functionality. CGI proxies may be used to gain access to web sites blocked by corporate or school proxies. Since a CGI proxy may also hide a user's own Internet Protocol (IP) address from Web sites accessed through the proxy, CGI proxies may also be used to gain a degree of anonymity, called “Proxy Avoidance.”
Many organizations including corporations, schools, and families use a proxy server to enforce acceptable network use policies (e.g., censorware) or to provide security, anti-malware and/or caching services. A traditional Web proxy is not transparent to the client application, which must be configured to use the proxy (manually or with a configuration script). In some cases, where alternative means of connection to the Internet are available (e.g., a SOCKS or other Internet server or Network Address Translation (NAT) connection), the user may be able to avoid policy control by resetting a client configuration and bypassing the proxy. Furthermore, administration of browser configuration can be a burden for network administrators.
An intercepting proxy combines a proxy server with NAT. Connections made by client browsers through the NAT are intercepted and redirected to the proxy without client-side configuration. Intercepting proxies may be used in businesses to prevent avoidance of acceptable use policy and to ease administrative burden, since no client browser configuration is required. Intercepting proxies may also be used by Internet Service Providers in many countries in order to reduce upstream link bandwidth requirements by providing a shared cache to their customers.
Typically, users spend a reasonable amount of time reading a Web page after the page has been requested. Users are more likely to click on or select a link on the current page rather than enter a completely unrelated Web address or URL (Uniform Resource Locator).
A number of web proxy and content filtering products include an ability to analyze contents of a requested Web page. For example, Web page URL and/or component content may be compared against a list of blocked URLs, a list of allowed URLs, malware and/or other content definition, etc. However, Web pages are only fetched if the pages are explicitly requested. That is, content analysis must be done at a time the page is first fetched. For example, content analysis is executed when a user clicks on or selects a link on a Web page to access another Web page. Content analysis at access may restrict the depth of analysis possible, as a streaming latency of a requested page must be kept to a minimum. Thus, systems and methods providing more detailed and/or customized analysis of Web pages would be highly desirable.
Currently systems, such as the NetCache DynaBLocator, either use a static list of URLs for Web page content analysis or analyze a page when the page is requested. Therefore, content analysis must be quick to keep page rendering latency low. When viewing pages via a web proxy, access to some websites referred to in a current page may not be accessible due to a policy restriction. A user browsing the Internet may be frustrated to discover that access to a particular site is blocked only after clicking on a link. Thus, there is a need for systems and methods to improve Web page content analysis while maintaining a low page rendering latency.