The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the Web”. The Web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the content and format of a hypermedia document (e.g., a Web page).
Each Web page can contain embedded references, referred to as “links”, to images, audio, video or other Web pages. The most common type of link used to identify and locate resources on the Internet is the Uniform Resource Locator, or URL. In the context of the Web, a user, using a Web browser, browses for information by selecting links that are embedded in each Web page.
An important aspect of browsing the Web is the use of Internet “cookies”. In general, a cookie is data that is included in the header of a Web page sent by a Web server to a Web browser that is returned by the Web browser to the Web server whenever the Web browser requests Web pages from the Web server.
FIG. 1 is a sequence diagram that illustrates a typical exchange of cookie information based on a user request. At step 1, Web browser 102 issues a request for data from Web server 104. At step 2, Web server 104 processes the request and provides a cookie along with a response including the requested data to Web browser 102. According to HyperText Transfer Protocol (HTTP), Web server 104 sends the response with a HTTP header that includes a “Set-Cookie” command associated with a corresponding name and value. For example, the HTTP header may include the statement “Set-Cookie: RMID=732423sdfs73242”. Thus, the name of this cookie is “RMID” and the value of the cookie is “732423sdfs73242”.
The “Set-Cookie” instruction requests Web browser 102 to store the name=value string and to send it back in all future requests to Web server 104. Thus, some time later, at step 3, Web browser 102 issues another request for data from Web server 104. This latter request includes the cookie that originated from Web server 104. Web browser 102 only offers a particular cookie to the Web server 104 (or domain) that set the particular cookie.
Cookies can contain any arbitrary information a Web server chooses and are used to maintain state between otherwise stateless HTTP transactions. Cookies are typically used to authenticate or identify a registered user of a Web site as part of their first login process or initial site registration without requiring them to sign in again every time they access that site. Other uses include maintaining a “shopping basket” of goods selected for purchase during a session at a site, site personalization (presenting different pages to different users), and tracking a particular user's access to a site. Thus, cookies are used to uniquely identify users.
Privacy issues relating to the use of cookies has been the topic of recent discussion. Much of the discussion, however, has evolved around common misconceptions about cookies. Some misconceptions include the following: (1) cookies are like worms and viruses in that they can erase data from the user's hard disks; (2) cookies are a form of spyware in that they can read personal information stored on the user's computer; (3) cookies generate popups; (4) cookies are used for spamming; and (5) cookies are only used for advertising. Typically, cookies are only data, not program code—thus, cookies cannot erase or read information from a user's computer.
However, cookies do allow for detecting the webpages viewed by a user on a given website or set of websites. This information can be collected in a profile of the user. Such profiles are often anonymous; that is, profiles do not contain personal information of the user (e.g., name, address, etc.). More precisely, profiles cannot contain personal information unless the user has made it available to some sites.
Furthermore, profiles are not easy to generate because a profiler must agree with different websites to put, e.g., ads on those websites. When a Web browser downloads webpages from those websites, the Web browser also downloads the ads from the profiler. The profiler can then set a cookie with respect to the Web browser and then determine which websites the particular user is visiting. Additional information on cookies is provided in Request For Comment (RFC) 2109.
Although the use of cookies is a relatively innocuous technique for a website to track activity of a particular user on that website, many users would feel more comfortable that the information that they share with certain websites is (1) not stored for an appreciable amount of time and (2) not shared with any other entity. Such information may include, for example, IP address of the user, the webpages and/or files requested by the user, terms of a search query submitted by the user, etc.
One approach that a website might implement for respecting the privacy of users may be to not keep track of IP addresses and other information submitted in user requests. However, such an approach is undesirable for many reasons. In the context of Web queries, performing analysis on the search terms of a query could assist the corresponding search engine in deciding which advertisements to provide to the user.
For example, if a user has previously searched for vacation plans to Mexico and later submits a query unrelated to vacations or Mexico, the search engine could use the previous information to provide information to the user advertising certain Caribbean cruise lines. If that previous information was not stored and associated with that user, then the search engine could not leverage the previous information to its benefit.
Another example of performing analysis on search terms is in the context of click fraud. Some unscrupulous users click on advertisements without any intent on purchasing the product or service promoted thereby. Such users are motivated by generating ad revenue for themselves or depleting the ad revenue of a competitor. Keeping track of who is submitting such requests allows click fraud analyzers to identify such users and prevent further attacks from those users.
Another approach that a website might implement for respecting the privacy of users may be to store cookie information for users only for a limited time (e.g., two weeks) and afterwards deleting information associated with a user, such as IP address or search terms. This approach allows some analysis of user requests to be performed while ensuring that users' privacy is guaranteed in the long run. However, this approach is also unattractive because much off-line filtering and manual analysis requires many consecutive months worth of data.
Another approach that a website might implement for respecting the privacy of users may be to disassociate the IP address from other parts of a user request. For example, suppose a user submits a search query. The search engine may delete IP address information and store the terms of the search query. However, this approach is disadvantageous in various contexts for similar reasons stated above—for example, click fraud will be more difficult to identify and prevent.
Furthermore, in the Web query case, stripping out IP and cookie data may not be enough to ensure privacy of the user because search terms themselves might have enough information to deduce who sent the query.
Therefore, there is a need to better balance the privacy interests of users with the aims of various websites to perform analysis on user requests.