Linked Data is increasingly used in the Web, both by governmental and business organizations. Linked Data is a way to publish data using standard Web technologies (HTTP and URI), and to leverage the expressiveness of the Semantic Web (Linked Data is encoded using Resource Description Framework (RDF), commonly used to describe Linked Data graphs). A Linked Data graph G is published (using RDF) as web data and is accessible via a browser.
The key differentiating strengths of Linked Data are (1) the well-defined semantics allowing automated reasoning (ability to infer new data from existing one), and (2) the implicitly interlinked nature of the information.
In the Linked Data world, data is represented by entities with formally defined semantics: each entity has a set of properties, and a property can connect two entities or an entity to a value of a defined data type. The resulting underlying data structure is a directed labeled graph, where nodes represent entities, and edges represent properties. Entities and properties are typically uniquely identified by Uniform Resource Identifiers (URIs).
URIs can be dereferenced. Dereferencing consists essentially of an HTTP GET operation, which retrieves additional information about the data (entity or property) identified by the URI being dereferenced.
Well-defined semantics and URI dereferenceability makes Linked Data graphs unique with respect to traditional relational data and graph data. These two characteristic aspects of Linked Data makes it possible for a software program to automatically augment a given Linked Data graph with new information, either by inferring it from the semantics of the graph (through inference) or by retrieving it from the Web (by dereferencing URIs). In such a scenario, it is particularly challenging to guarantee anonymity of potentially sensitive information published as a Linked Data graph.
Existing anonymization techniques work well either on relational data or graph structures (including social network graphs).
Given a set of quasi-identifying properties of the data, traditional anonymization techniques guarantee k-anonymity, that is for each combination of values of the quasi-identifying properties there are at least k entities having that combination of values (forming an equivalence class) or none. If a sensitive property is also given as input, existing techniques can also guarantee l-diversity, which ensures k-anonymity and also that in each equivalence class there are at least l well represented values for the sensitive property. There are also different variants of l-diversity, for example t-closeness, which ensures a distance no larger than a threshold t between the distribution of the values of the sensitive property in the overall data and in any equivalence class.
On the other side, existing anonymization techniques for graphs usually modify the graph structure either by changing the degree of a node, or by coarsening the graph (replace nodes in a neighborhood with a single node). Finally, some graph anonymization techniques also exploit specific properties of the graph structure.
There are also anonymization techniques specifically designed for social networks. One approach consists of adding some “noise” in the graph by inserting additional edges or removing edges, with the purpose of preventing attacks based on background knowledge about some neighborhood in the graph (i.e. exploiting the degree of the nodes). Another approach combines k-anonymity with edge generalization, but assumes that edges (properties) have the same meaning.