A query log contains valuable information about the searches and corresponding actions performed by users as they interact with a search engine. For example, a web query log collects queries and clicks of users issued to an Internet search engine. Alternatively, a query log may contain queries issued by users and actions performed on the displayed results (e.g., logs for enterprise search, mobile search, database search, product catalog search/transactions, and so forth).
A query log can be very useful for providing access to customer services. For example, accessing a query log can help a company improve existing products and services (e.g., keyword advertising) and build new products and services.
Moreover, query logs are valuable data sources that are currently not available to the research community. For example, in many instances an Internet query log is more useful than a web crawl or document repositories as the query log may be used to understand the behavior of users posing queries, and to obtain algorithms for problems such as computing related searches, making spelling corrections, expanding acronyms, determining query distributions, query classification, and/or tracking the change of query popularity over time. Advertisers can use such a log to understand how users navigate to their web pages, to gain an understanding of their competitors, and to improve keyword advertising campaigns.
However, a query log contains a considerable amount of private information about individuals, and thus a search company cannot simply release such data. Indeed, user searches provide an electronic trail of confidential thoughts and identifiable information. For example, users may enter their own name or the names of their friends, home address, and their medical history as well as of their family and friends. In the case of web query logs, users may enter their credit card number and/or social security number as a search query, just to find out what information is present on the web.
In sum, releasing a query log is beneficial for various data-mining tasks; however, doing so risks compromising user privacy. Previous attempts to release query logs while maintaining privacy have failed. One attempt replaced user names with random identifiers; however, the searches were easy to match to an individually identifiable person based on the rest of the data. Other ad-hoc techniques, such as tokenizing each search query and securely hashing the token into an identifier, have been explored in the literature and are shown not to protect privacy.