This specification relates to data management.
The Internet provides access to a wide variety of content items, e.g., video and audio files, web pages, and news articles. Such access to the content items has enabled opportunities for targeted advertising. For example, content items can be identified to a user by a search engine in response to a query submitted by the user. The query can include one or more search terms, and the search engine can identify and, optionally, rank the content items based on the search terms in the query and present the content items to the user (e.g., according to the rank). The query can also be an indicator of the type of information of interest to the user. By comparing the user query to a list of keywords specified by an advertiser, it is possible to provide targeted advertisements to the user.
Another form of online advertising is advertisement syndication, which allows advertisers to extend their marketing reach by distributing advertisements to additional partners. For example, third party online publishers can place an advertiser's text or image advertisements on web pages that have content related to the advertisement. As the users are likely interested in the particular content on the publisher webpage, they are also likely to be interested in the product or service featured in the advertisement. Accordingly, such targeted advertisement placement can help drive online customers to the advertiser's website.
The serving of the advertisements can be improved by evaluating the effectiveness of the advertisements. One technique for evaluating the effectiveness of an advertisement is evaluating online user behavior to determine whether online user behavior as manifested by web site visitations and search activity have increased due to the display of ads. However, to conduct this analysis, a system accesses several data logs that store data related to online user behavior and advertisements that were served to the users. Each of the data logs is keyed to different identifiers for user devices. For example, for a particular user device, a first identifier may be used in an advertisement log that stores records detailing advertisements that were served to the user device, and a second identifier may be used in session logs that stores records detailing actions taken by the user during web browsing sessions. These identifiers are not the same identifies for several reasons. First, the advertisement management system and the system that stores session information may be disparate systems that do not coordinate the assignment of user identifiers. For example, each system may have different identifier rules, e.g., the advertisement management system may assign a new identifier every three months, while the system that stores session information may assign new identifiers every month. Second, the system designers take care to protect user privacy, and thus the identifiers may not be globally unique identifiers, and the records may not store any personal identifying information (such as a user name, address, etc.). Other privacy protection measures include the redaction of personal identifying information from records, pseudo-anonymization of the information, and access controls to the user information.
While there are techniques to reconcile particular records stored in multiple logs to a single user device, the techniques may results in false-positives (matching two records when, in fact, they should not be matched) and/or false-negatives (not matching two records when, in fact, they should be matched).