The exemplary embodiment relates to processing data and in particular to identifying co-referring entities in serialized data for structuring and enriching databases with additional information.
In recent years, many organizations have begun to make use of web-services to communicate data within their organizations, or with a wider public. In order to be transferred, this data is typically serialized, a process in which the structure of the schema underlying the data is lost. The schema may be a structured database or abstract objects used in programming languages. Examples of serialization formats used for serializing data objects include JSON (JavaScript Object Notation), XML (Extensible Markup Language), and YAML (a data serialization standard which includes JSON as well as including other features).
XML schema inference has previously been studied (see, for example, U.S. Pat. No. 6,792,576). XML allows for a richer expression than JSON, its syntax is stricter and the available options for schema designers are larger. In schema inference for XML, the identification and co-reference of entities is assumed to be straightforward, and the only difficulty is in how to learn the hierarchical relationships between them. However, JSON is becoming increasingly popular for serializing data objects, partly due to its lightweight format and human-readability. It is also very general format, but easy to formalize (permitting only two ways of creating compound objects). The usage of JSON formats encourages a looser control of the overall structure, and therefore the use of different names to refer to the same concept is commonplace.
Since the JSON format is schemaless, the schema itself is not transmitted with the data. It would be desirable to be able to infer at least part of the schema from which the data was generated. However, a user observes only a few instantiations of the data through queries and thus the variability in structure poses challenges to the design of a schema inference engine.
Part of the information that is lost when data is serialized can be retrieved by finding out which fields correspond to the same underlying concept. For instance, the id of a Person may be referred by another service of the same provider as PersonId. Similarly, PlaceOfBirth may also be called City.
Identifying co-referring data is less challenging when there is some coherence, such as when the data comes from the same source, although there are still problems to be solved. In the case of entities coming from a web-service under a JSON format (a tree-like format), the only available context-information for each node is its ancestors and descendants.
Duplicate record detection (or instance matching) and similar methods have been applied to the more generic problem of ontology matching. However, in that case, the relationships between entities are much looser, with no clear hierarchical relationships. Existing methods generally perform the matching bottom-up. However these approaches are less successful when lower level nodes are very similar if their ancestor context is not taken into account (labels and even values are often repeated).
One attempt to address the problem in the case of JSON data is described in Cánovas Izquierdo, et al., “Discovering Implicit Schemas in JSON Data,” Proc. 13th Intern'l Conf. on Web Engineering, Web Engineering, vol. 7977 of Lecture Notes in Computer Science, pp. 68-83 (2013), hereinafter, “Cánovas.” However, there are several drawbacks in the Cánovas method. For example, concept (compound objects) are treated differently from properties (atomic types). This may cause problems if one type were to be exchanged for another. For example, a property is made more complicated so that an atomic type no longer suffices. Other drawbacks include concepts being treated equally as soon as their name (key value) is the same and properties being merged as soon as their value coincides and they belong to the same class. For example, one query result may include a Person with a weight of 65, and another query result includes a Person with an age of 65. Here, where two different properties of the same class have the same value, the Cánovas method fails. Other methods proposed simply perform one-to-one mappings of a JSON file to a JSON schema, which is too limited for most cases.
There remains a need for a system and method for inferring schema and co-referring types from serialized data.