Large amounts of personal data are collated and stored by a range of businesses and organisations. In many circumstances, it is desirable to share insight and intelligence that may be gained from such data, without compromising the privacy of individuals contributing to the data. One way in which this may be achieved is through the construction of a statistical database, which holds the personal data and accepts queries from third parties. Instead of releasing individual data entries, the statistical database gives out statistical results based on the characteristics of the personal data held within the database. Example queries which may be submitted include summations or aggregate counts. In order to provide increased privacy protection for individuals contributing to the database, the statistical database may release inaccurate results, such as a range within which the query result falls, rather than the value of the query result.
In principle, statistical databases permit the sharing of intelligence gained from personal data without compromising the privacy of individual data entries. However, malicious third parties, known as adversaries, may formulate queries with the specific purpose of deducing individual data entries from the database. Using carefully formulated query combinations, frequently combined with auxiliary information obtained from other independent sources, adversaries can gain access to individual data entries.
The growth of communication networks has led to an unprecedented rise in the volume and detail of personal data available to the operators of such networks and to service providers who offer services through the networks. This data may include details of subscriber interests, commercial activities and subscriber location, as well as identity data and mobility data for the subscriber. Mobility data contains the approximate whereabouts of individual subscribers in the network at any given time, and can be used to reconstruct an individual's movements over a period of time. Individual mobility traces have been used to provide personalised services to users including tracking the movement of a competitor sales force, registering subscriber attendance at a particular event or individual subscriber presence in a specific location (e.g. hotel, commercial centre or hospital). Such data may also be used by service providers or third party marketers for the development of personalised advertising campaigns. Anonymised mobility data may also be provided to third parties for use in human mobility analysis, which involves the study of individual and group movement patterns in order to provide useful insight for a range of practical applications including urban planning, traffic congestion mitigation, mass transit planning, healthcare and education planning, ecological and green development etc.
Communication network providers may thus make legitimate use of statistical data based on the large volume of information about subscribers' real world activities to which they have access. They may also make such statistical data available for legitimate third party use. However, unethical advertisers or other adversaries may seek to acquire from the statistical data sensitive information about individual subscribers in order to support aggressive or abusive marketing practices. This may involve combining data from different independent sources with complex database query combinations in order, for example, to track individual user locations or other sensitive individual data. This data may then be used for aggressive marketing or to create a highly individualised, believable message for the targeting of even more sensitive information from the user, as is the case in phishing scams and other spam mail.
Although an anonymized dataset does not contain name, home address, phone number or other identifiers, if an individual's mobility patterns are sufficiently unique, independently sourced secondary information may be used to link mobility data back to an individual.
In order to protect the privacy of individuals whose data may be held in a statistical database, techniques have been developed to ensure the anonymity of individual data entries and combat the above discussed abusive practices. A first technique is known as k-anonymity, and involves suppressing or generalising individual data attributes until each row or entry within the database is identical to at least k−1 other entries. Although this technique hides the personal identity of individuals within a database, it has been shown that adversaries possess sufficient additional sources of personal data to enable the mapping of individual users onto an anonymised data set, so compromising individual privacy.
Another technique which may be used to protect privacy in statistical databases is differential privacy. This technique involves adding noise to a query result before that result is released to the third party generating the query, with the aim of ensuring that the presence or absence of any particular individual in the database will not significantly affect the noise perturbed query result. In this manner, a third party is prevented from using sophisticated query combinations with auxiliary data to determine individual data entries. The noise value to be added to the query result is usually generated according to a Laplacian probability distribution, although a Gaussian distribution may also be used. The probability distribution is often scaled according to the sensitivity of the query, in an effort to balance the conflicting aims of privacy protection and the provision of useful statistical data. A probability distribution for Laplacian noise is illustrated in FIG. 1, with noise values on the x axis and probability of generating noise values on the y axis. The width of the distribution may be scaled according to the sensitivity of the query, sometimes referred to as the diameter of the query. The mean of the distribution is set to zero, such that positive and negative noise values are equally likely.
The aim of differential privacy is to perturb the results of database queries such that privacy of individuals is protected while still providing statistical data that is of value to third parties. While this technique has proved effective in the past, experiments have shown that when applied to use cases including human mobility data, as well as other existing use cases, known differential privacy techniques remain vulnerable to aggressive adversary querying strategies.