The present invention is in the field of data storage. In particular, the embodiments of the present invention relate to the storage of triples describing graph data within a distributed storage environment.
Relational databases store data in rows and columns. The rows and columns compose tables that need to be defined before storing the data. The definition of the tables and the relationship between data contained on these tables is called a schema. A relational database uses a fixed schema. Graph databases represent a significant extension over relational databases by storing data in the form of nodes and arcs, where a node represents an entity or instance, and an arc represents a relationship of some type between any two nodes. In an undirected graph, an arc from node A to node B is considered to be the same as an arc from node B to node A. In a directed graph, the two directions are treated as distinct arcs.
Graph databases are used in a wide variety of different applications that can be generally categorized into two major types. The first type consists of complex knowledge-based systems that have large collections of class descriptions (referred to as “knowledge-based applications”), such as intelligent decision support and self learning. The second type includes applications that involve performing graph searches over transactional data (referred to as “transactional data applications”), such as social data and business intelligence. Many applications may represent both types. However, most applications can be characterized primarily as either knowledge-based or transactional data applications. Graph databases can be used to maintain large “semantic networks” that can store large amounts of structured and unstructured data in various fields. A semantic network is used as a form of knowledge representation and is a directed graph consisting of nodes that represent concepts, and arcs that represent semantic relationships between the concepts.
There are several types of graph representations. Graph data may be stored in memory as multidimensional arrays, or as symbols linked to other symbols. Another form of graph representation is the use of “tuples,” which are finite sequences or ordered lists of objects, each of a specified type. A tuple containing n objects is known as an “n-tuple,” where n can be any non-negative integer greater than zero. A tuple of length 2 (a 2-tuple) is commonly called a pair, a 3-tuple is called a triple, a four-tuple is called a quadruple, and so on.
The Resource Description Framework (RDF) is a general method for conceptual description or modeling of information that is a standard for semantic networks. The amount of RDF data that is available nowadays is growing and it is already impossible to store it in a single server. In order to be able to store and search large amounts of data, the data must be maintained in multiple servers. Adding, deleting and querying data must be done in a coordinated way, using algorithms and data structures specially tailored for distributed systems. It is desirable to store graph data in a way which enables computationally efficient querying, maintenance, and manipulation of the data.
As with all computing hardware, there is always some risk that a storage node (such a server) on which data is stored will fail. Thus, it is known in the art to provide “redundant” storage nodes storing copies of data in case of failure of a storage node. However, providing such a redundant storage node can be costly in terms of infrastructure provision, maintenance, and running costs. As reliability of storage nodes increases, the provision of redundant storage nodes purely to enable data recovery in case of failure of another node is of decreasing value in terms of costs per utilisation.