Cookies are small pieces of information that a web site stores on a user's computer. A cookie can be viewed and modified only by web pages on the same domain as the page that originally placed the cookie on the user's computer. Once a cookie has been placed on a user's computer, a web browser running on that computer will send that cookie along with every Hypertext Transfer Protocol (HTTP) request to the site from which the cookie originated.
Cookies have many legitimate, useful purposes, such as storing user preferences or automatically filling in form information that was entered in a previous session. However, cookies can also be used for non-legitimate reasons. One use of cookies that many people consider to be an invasion of privacy is the tracking of user behavior on the web for the purpose of targeting users with specific advertisements. This is accomplished as follows:
A user visits a legitimate web site, for example www.i-like-cars.com. This site includes an image that is downloaded from another site, for example www.ads.com. This image may be something obvious, like an advertising banner, or it may be something that the user will not even notice, such as a 1 by 1 pixel, white Joint Photographic Experts Group (JPEG) image. When www.ads.com returns the image to be displayed on www.i-like-cars.com, it also returns a cookie that contains a unique identifier for this user. Whenever the user visits a site that contains an image to be downloaded from www.ads.com, ads.com will receive the cookie uniquely identifying the user. If www.ads.com distributes its banner ads such that the HTTP request for each banner contains the Uniform Resource Locator (URL) of the page from which the request came, ads.com will know which page the user is visiting when the user receives an advertisement.
The user later visits a different site, say www.i-like-sports.com, that contains an advertising banner from www.ads.com. The cookie that identifies the user is delivered to ads.com when the banner ad is requested. Ads.com determines from the content of the cookie that the user previously viewed www.i-like-cars.com, and in response returns a car-related ad. Over time, ads.com will learn what sort of web pages this user visits, and will return ads that are targeted specifically to the user's interests.
The cookie returned from ads.com in this example is known as a third party cookie, because it belongs to a domain different than that of the primary web page currently being viewed (in this example, www.i-like-cars.com and later www.i-like-sports.com). Tracking cookies must, by definition, be third-party cookies. Since a cookie will only be sent to sites within the domain that originally issued it, first party cookies can only be used to track a user's behavior within a single domain.
One straightforward approach to protecting a user from tracking cookies is to maintain a list of known tracking cookies. An application could then periodically scan the user's computer and delete all cookies that are on the blacklist. However, maintaining a list of every tracking cookie on the internet is difficult, and would be very labor intensive if no automation were used.
One way to automatically build a list of tracking cookies is to use a web crawler to continually search the web. Since all tracking cookies are third-party cookies, the web crawler could simply traverse the web and store every third-party cookie that it identifies. Since there are few legitimate uses of third-party cookies, one might think that a large percentage of the third-party cookies received would be tracking cookies. However, in reality this is not the case. Several legitimate sites return cookies with every HTTP response, even in response to requests from third party sites. For example, site A might want to include in its page an image that is hosted on site B. If site B is configured to issue a cookie containing default user preferences along with every HTTP response, this cookie will look like a tracking cookie when it is received with the image that is embedded at site A. If many sites embed content from site B, this cookie might look like an especially prevalent tracking cookie to a web crawler.
What is needed are computer implemented methods, computer readable media and computer systems for accurately detecting tracking cookies, without generating a large number of false positives.