A variety of methods exist for collecting browse history data reflective of the browsing behaviors of users across a variety of web sites. For example, proxy servers and other intermediary systems commonly log the URL (Uniform Resource Locator) requests of users by monitoring network traffic. As another example, some companies offer browser toolbars or plug-ins that report back to the company's servers the URL requests made by users. Browsing behaviors are also commonly logged by the origin servers of the web sites. The browse history data collected by these and other methods may be used for a variety of purposes, such as to generate personalized content for users or to generate statistical data that is useful to content providers.
Regardless of the particular data collection method or methods used, the collected browse history data frequently includes personally identifiable information (PII) in the form of URL parameters. Ideally, the PII should be removed from the logged URLs—or the PII-containing URLs deleted—to reduce or eliminate privacy issues. The task of identifying URLs and URL parameters that contain PII, however, is complex, as different web sites commonly use very different URL formats.