Web sites are generally composed of Hypertext Markup Language (HTML) content. Individual web pages may contain links to other web pages, files, multimedia, and so forth. A web site may be associated with a particular Domain Name System (DNS) name, such as “www.microsoft.com” that resolves to one or more Internet Protocol (IP) addresses that identify one or more servers hosting the web site. Web pages within a website are typically identified by paths added to the DNS name, such as “www.microsoft.com/office/” where “office” specifies a virtual directory that contains a web page, or query strings, such as “www.website.com/pages?content=123” where the server interprets “content=123” to identify a particular web page stored at or generated by the server. Many content management systems (CMS) automatically generate pages and links within pages based on a content structure defined at the server.
At times, it is useful to move content from its original location or to provide users with a friendlier or more memorable path to access content than one provided automatically by a CMS. In some cases, an administrator may prefer a path that provides search engine optimization (SEO) advantages by containing particular keywords in the uniform resource locator (URL) path. Existing URL rewriting components of web servers allow a server administrator to specify rules for mapping incoming URLs provided by clients accessing a website to internal URLs that the web server recognizes. For example, a URL rewriting rule may allow users to specify a path “www.website.com/games” to access game-related content, rather than a less user friendly URL where the server provides the content like “www.website.com/?content_id=1234&layout=column.”
Unfortunately, existing rewriting components of web servers focus on requests and often do not handle or poorly handle links in responses. A web page provided by a web server in response to a request may contain links to other pages or content using URLs that are not modified in the same way request URLs are modified. Administrators may become confused trying to maintain two different URL schemes (e.g., an original and a user-friendly one) and may make mistakes so that links to content do not work as expected. In addition, SEO advantages may not be realized when links in responses are not transformed in the same way as request URLs. Users and software (such as search engines) also may miss relationships between content that does not share a common path. For example, while it may be clear that “/news” and “/news/sports” are related, it is less apparent that the same relationship may exist on a particular web server for a link to the same content that is not properly rewritten, such as “/pages?article_id=abcd.” Although some solutions attempt rewriting of content responses, they do so in a string-based manner, crudely searching the entire response for matching text strings without context. This can lead to errors in web pages and replacement of text that should not be replaced (e.g., quoted text that is not part of a link) as well as other problems.