Relational databases generally store data in rows and columns. Graph databases represent a significant extension over relational databases by storing data in the form of nodes and arcs, where a node represents an entity or instance, and an arc represents a relationship of some type between any two nodes. Graph database representations allow data modeling that more closely parallels the real world and provides a visual representation of connected data. In general, a graph is a set of objects, called points, nodes, or vertices, which are connected by links, called lines or edges. The edges establish relationships (connections) between the nodes. Graphs can be directed or undirected. In an undirected graph, an edge or line from point A to point B is considered to be the same as a line from point B to point A. In a directed graph (digraph), the two directions are treated as distinct arcs or directed edges.
Graph databases are used in a wide variety of different applications that can be generally categorized into two major types. The first type includes complex knowledge-based systems that have large collections of class descriptions (referred to as “knowledge-based applications”). Knowledge bases in the life sciences or biological modeling are examples of this type of graph-based application. The second type includes applications that involve performing graph searches over transactional data (referred to as “transactional data applications”). Social network analysis, telecommunications services and data mining, enterprise database integration, fraud detection, and telemetry are some examples of this second type of graph-based application. Many applications may actually represent both types of application, however, most applications can be characterized primarily as either knowledge-based or transactional data applications. Governments and other large entities often use graph databases to maintain large so-called “semantic networks” that can store large amounts of structured and unstructured data in various fields, such as biology, security, telecommunications, and so on. A semantic network is often used as a form of knowledge representation. It is a directed graph consisting of vertices that represent concepts, and edges that represent semantic relationships between the concepts.
Typical operations associated with graphs, such as finding a path between two nodes or finding the shortest path from one node to another node are performed by graph algorithms. Graph algorithms are used in many types of data processing applications. One present, known graph database is the Cogito Graph Engine which represents information as entities (nodes) and relationships (arcs) in a scalable store to execute high performance traversal and retrieval operations. The Cogito Graph Engine provides modeling and query services in addition to fundamental graph services. Unlike other systems that use an “in-memory graph,” the Cogito Knowledge Center provides a “persistent graph” that spans the size of memory available. Users can model the data to establish various contextual relationships, ontologies and views of the information. Data points are identified as class-typed nodes in the graph overlay, and relationships are represented as arcs with definable arc types. Modeling flexibility allows analysts to change the graph model structure to easily see different perspectives on potential patterns or relationships. Once the data has been imported, modeled, and linked, it can be analyzed based on user queries. Users can query the data to see if relationships exist between seemingly unrelated data points, or identify the shortest path between particular data points, or even try to determine whether specific patterns exist within the data. The data can further be analyzed to reveal which connections are the most powerful or the weakest, how one data point affects other data points, or how information is interrelated.
In general, there are many possible types of graph representations. Graph data may be stored in memory as multidimensional arrays, or as symbols linked to other symbols. Another graph representation are “tuples,” which are finite sequences or ordered lists of objects, each of a specified type. A tuple containing n objects is known as an “n-tuple,” where n can be any non-negative integer. A tuple of length 2 (a 2-tuple) is commonly called a pair, a 3-tuple is called a triple, a four-tuple is called a quadruple, and so on. Tuples are used to describe mathematical objects that consist of specified parts. Although embodiments described herein may relate exclusively to triples (3-tuples), it should be noted that such embodiments could apply to tuples of any length. Furthermore, although the term “triple” may imply a representation based on three items, it should be understood that actual computer-based implementations may involve more than three items.
In typical implementations, triples are stored in memory in the form of “triple-stores.” The triple-parts (including a unique part identifier and other fields) are all stored as columns in a field, where each field is individually indexed, or indexed on any combination of parts. One disadvantage associated with present methods of storing the triple-parts as strings in tables is that it is very expensive both in terms of storage and in terms of processing overhead, as this method may require many long text string comparisons. Further prior art solutions are known in which every part is replaced by a unique identifier. However, these implementations have several inherent drawbacks, such as: lack of scalability in that at present no triple-store can load well beyond several billions of triples; limited range queries, in that no triple-store allows for range queries on numeric values; and lack of distributed processing capability, in that the same string would have different identifiers, and other similar disadvantages.
To illustrate the drawbacks associated with present graph database systems, consider an example in which it is desired to store 20 billion triples with 5 billion unique strings where each string has on average 25 unicode characters (˜50 bytes). Such a number, while very large, may represent only one month of telephone records in the United States, or all of the people in China, with each person associated with 15 descriptive terms (parts), each part stored in a “slot” of a certain type (e.g., float, integer, etc.). With regard to scalability, for part-to-identifier mapping, the fastest way of interning the strings is to keep the mapping in memory. In this case, a hash-table or a trie could be used. However, for 5,000,000,000 unique parts this means that with a simple hash table, a 7,000,000,000 vector on a 64 bit system (=56 Gigabyte) plus the part itself (=250 Gigabyte) plus the link to a unique identifier (requiring two 64 bit integers=80 Gigabyte) would be needed. The total memory requirement would thus be on the order of 386 GB, and this is only for part-to-number mapping. With some clever processing and the use of text tries, this could possibly be reduced to roughly 100 GB, but this is still a great deal of memory in which to store only a single mapping. Although it would be possible to store the part-to-identifier mapping completely on disk in a b-tree (prefix-b-tree, disk-trie, etc.), this means that for large amounts of parts, accessing a mapping becomes ultimately diskbound.
With regard to the range problem, some users of a graph database only need to perform graph searching functions on their data. Others may need to perform range queries on numeric values. For example, in a timestamped application (a database application that records events that are associated with a particular time), where a user seeks every event between time x and y; or real-world coordinate applications, where a user seeks every object within pre-defined boundaries, present implementations do not facilitate any type of range queries, and the user will need to use additional mechanisms (like b-trees) to address range oriented mappings.
With regard to distributed processing, under current implementations it is generally sub-optimal to load triples in parallel on tuple machines. Especially if one wants to combine the loaded triples after the loading process. If a user processes triples from different sources in parallel (and completely separately) only queries can be performed that combine the different triple stores by communicating through the string values of the triple parts after the data has been loaded. If one wants the triple-parts in tuple machines to have the same unique part identifiers, the part-to-number mapping must be shared at load time. This requirement may already be a bottleneck on one machine, and therefore even more so in a tuple machine configuration.