The present invention relates generally to knowledge graph augmentation, and more particularly, but not by way of limitation, to a system, a method, and a recording medium for extending a knowledge graph based on a large corpus of structured data.
Conventional techniques may rely only on labels in schema such as column headers, may use a single statistical measure to find relevant attributes, may rely on text surrounding or describing the structured source, and may rely on query logs and work only when a large user base is available.
Other conventional techniques may merely create an “Attribute Correlation Statistics Database” (AcsDb) which contains attribute counts based on the headers of the respective columns. From these counts, the conventional techniques estimate attribute occurrence probabilities. Applications for this database are a schema auto-complete function, synonym generation and a tool enabling easy join graph traversal for end-users. Such conventional techniques evaluate their Schema Auto-Complete system by giving both the system and a number of humans a single (key)attribute and ask them to create a schema, then measure to which extent the system can reproduce the schema created by the humans. The focus of the exemplary conventional techniques is to find names for possible attributes. The conventional techniques do not consider the content/values of these attributes and hence do not evaluate this aspect.
Other conventional techniques propose an integration of data using web tables, defined as an “EXTEND” operation. These techniques require the user to provide an input table and a keyword query. Based on the user provided input, these techniques attempt to determine one or multiple tables that can extend the input table with the topic given by the keyword query. These conventional techniques use a search engine to find the tables and different value-based matching approaches to determine which tables can be combined. However, the “EXTEND” operation is evaluated by issuing a number of queries in the form (e.g., input table, join key, topic of column to be added) and checking the number of correct values that are returned. As a new attribute is specified, there is no need for a ranking by usefulness. It is unclear in these conventional techniques how the column that contains the value is selected.
Other conventional techniques use label-based and value-based schema matching techniques to map web tables to a knowledge base. For these techniques “Schema Complement” operation, the techniques consider all unmapped columns and rank the unmapped columns using the “AcsDb” and the entity coverage of the input table provided by the user. The goal is to rank complete tables by their usefulness for the complement task. However, such conventional techniques evaluate their system by letting users decide how related the tables are in their output are. The focus is on the tables and not the attributes.
That is, the above conventional systems, and other conventional knowledge graph augmentation systems are limited in their application in that they make no claims about the content/values of these attributes and hence do not evaluate this aspect, it is unclear in these conventional techniques how the column that contains the value is selected, and the focus of such techniques is on tables, not attributes.
Thus, there is a technical problem in the conventional systems that the methods of knowledge graph augmentation rely either on additional sources (e.g., search engine query log) or perform only basic label-based analysis such that they do not adequately augment existing knowledge graphs with any efficiency.