The exemplary embodiment relates to data management and in particular, to a more efficient knowledge base system.
Relational databases are widely used for storing structured information, such as entities and relations involving these entities. Conventionally, in such a database, fields are represented as columns and records as rows of a first table. Some of these fields include identifiers (IDs) in the records in place of data. Another table stores the relation information for the IDs for that particular field. A relation database management system (RDBMS) using a structured query language (SQL) controls the creation and access to the data.
Large databases have been created using information extracted from freely available resources. For example, databases such as Yago, and DBpedia store information that has been automatically extracted from web-based resources, such as WordNet and Wikipedia, by parsing the information provided for many different entities. Others, such Freebase, rely on contributors to supply the information. Such resources can help in many knowledge-related tasks. For example, they can be used as training data for supervised knowledge extraction systems, or as background knowledge for coreference resolution and named entity disambiguation and linking.
As an example, online resources could be used to populate records of a table in a Knowledge Base (KB) corresponding to a relation of type “X was born in Y”, where one field corresponds to the X entities, i.e., named entities, of type person, and another field corresponds to their respective birthplaces Y, which could be named entities of type “geographical location”. The fields of this table could use an ID for each of the person names, which is used to retrieve the person name from a separate person name table, and an ID for each of the geographical locations, which is used to retrieve the geographical location. Given a query, “what is Picasso's birthplace?,” in appropriate query language, the query system first accesses the person name table to find the ID for Picasso and then uses the “X was born in Y” table to find the ID of his birthplace. Finally, the geographical location table is accessed to find the name of the birthplace corresponding to the ID.
Another use of knowledge bases is to determine whether similar names in different documents refer to the same entity. For example one document may use a middle name or initial when referring to a named entity, whereas another does not. By looking at the properties of the two entities (the Y values in the above example), a decision can be made as to whether the documents refer to the same person.
The data available for creation of such databases contain millions of entities and often hundreds of millions of relations involving these entities. Hence, there has been an effort to provide an efficient storage of these resources that allows for fast loading and also fast query answering. One possible solution to the efficiency problem is to set up the KB system over a cluster of computers using data sharing, as is the case in NoSQL distributed DBMS. However, this increases the overall costs, both in terms of hardware and maintenance.
A system and method are provided which can improve the performance of KB systems on conventional hardware, particularly when dealing with large KBs.