The Internet is a worldwide public system of computer networks providing information, shopping capabilities and other kinds of business opportunities accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the web.” The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page). A client program, known as a browser, e.g. MICROSOFT® INTERNET EXPLORER®, GOOGLE® CHROME®, MOZILLA® FIREFOX®, APPLE® SAFARI®, runs on a user's computer and is used to issue requests to servers (i.e., web servers) that provide content of a web page and display it in human readable form. The request is typically in the form of a Uniform Resource Locator (URL) that identifies a web page or other information resource.
In addition to identifying a resource location, the URL may contain other information. In some cases, the information may be private information such as personal information, personally identifying information or other sensitive private information. Unscrupulous web sites or other parties may use the information in ways that a user may find objectionable or undesirable. As a result, it is often desirable to remove such information from data streams.
Conventional methods of removing private information typically involve removing (i.e., stripping) private information based on an explicit match or removing private information based on a set of rules.
In the case of removal based on explicit match, sets of private information can be provided by a third party source. This information usually includes full names, addresses, email addresses, credit card numbers, drivers license numbers, etc. When a URL is being stripped, each value contained in the URL is checked for a match to known private values. If the URL value matches, it is removed. This method is generally considered very weak and typically removes just a small fraction of private information. A majority of private information is unaffected. Thus, this method is typically inappropriate even with extensive tuning and a high number of external sources of private information.
Rules based removal is a slight generalization of the explicit match method described above. Rules based removal typically uses a list of manually created removal rules. The notion of a rule can be very general and therefore in theory, a removal rule can be generated for each type of private information. However, a weakness of this approach is the process of creation of the rules. There is usually no automatic mechanism for their creation. The rules are typically created either from a private information list as in the explicit match method described above or they are created manually by data analysts and aggregated over time. Thus, while this approach is more robust then the explicit match, it generally takes a lot of time and resources to develop a rule set that performs well. Moreover, the web based environment changes rapidly and thus substantial effort is typically required to keep the rules set up to date.