Relational databases generally store data in rows and columns. Graph databases represent a significant extension over relational databases by storing data in the form of nodes and arcs, where a node represents an entity or instance, and an arc represents a relationship of some type between any two nodes. Graph database representations allow data modeling that more closely parallels the real world and provides a visual representation of connected data. In general, a graph is a set of objects, called points, nodes, or vertices, which are connected by links, called lines or edges. The edges establish relationships (connections) between the nodes. Graphs can be directed or undirected. In an undirected graph, an edge or line from point A to point B is considered to be the same as a line from point B to point A. In a directed graph (digraph), the two directions are treated as distinct arcs or directed edges.
Graph databases are used in a wide variety of different applications that can be generally categorized into two major types. The first type includes complex knowledge-based systems that have large collections of class descriptions (referred to as “knowledge-based applications”). Knowledge bases in the life sciences or biological modeling are examples of this type of graph-based application. The second type includes applications that involve performing graph searches over transactional data (referred to as “transactional data applications”). Social network analysis, telecommunications services and data mining, enterprise database integration, fraud detection, and telemetry are some examples of this second type of graph-based application. Many applications may actually represent both types of application; however, most applications can be characterized primarily as either knowledge-based or transactional data applications. Governments and other large entities often use graph databases to maintain large so-called “semantic networks” that can store large amounts of structured and unstructured data in various fields, such as biology, security, telecommunications, and so on. A semantic network is often used as a form of knowledge representation. It is a directed graph consisting of vertices that represent concepts, and edges that represent semantic relationships between the concepts.
Typical operations associated with graphs, such as finding a path between two nodes or finding the shortest path from one node to another node are performed by graph algorithms. Graph algorithms are used in many types of data processing applications. One present, known graph database is the Cogito Graph Engine which represents information as entities (nodes) and relationships (arcs) in a scalable store to execute high performance traversal and retrieval operations. The Cogito Graph Engine provides modeling and query services in addition to fundamental graph services. Unlike other systems that use an “in-memory graph,” the Cogito Knowledge Center provides a “persistent graph” that spans the size of memory available. Users can model the data to establish various contextual relationships, ontologies and views of the information. Data points are identified as class-typed nodes in the graph overlay, and relationships are represented as arcs with definable arc types. Modeling flexibility allows analysts to change the graph model structure to easily see different perspectives on potential patterns or relationships. Once the data has been imported, modeled, and linked, it can be analyzed based on user queries. Users can query the data to see if relationships exist between seemingly unrelated data points, or identify the shortest path between particular data points, or even try to determine whether specific patterns exist within the data. The data can further be analyzed to reveal which connections are the most powerful or the weakest, how one data point affects other data points, or how information is interrelated.
In general, there are many possible types of graph representations. Graph data may be stored in memory as multidimensional arrays, or as symbols linked to other symbols. Another graph representation are “tuples,” which are finite sequences or ordered lists of objects, each of a specified type. A tuple containing n objects is known as an “n-tuple,” where n can be any non-negative integer. A tuple of length 2 (a 2-tuple) is commonly called a pair, a 3-tuple is called a triple, a four-tuple is called a quadruple, and so on. Tuples are used to describe mathematical objects that consist of specified parts.
In typical implementations, triples are stored in memory in the form of “triple-stores.” The triple-parts (including a unique part identifier and other fields) are all stored as columns in a field, where each field is individually indexed, or indexed on any combination of parts. One disadvantage associated with present methods of storing the triple-parts as strings in tables is that it is very expensive both in terms of storage and in terms of processing overhead, as this method may require many long text string comparisons. Further prior art solutions are known in which every part is replaced by a unique identifier. However, these implementations have several inherent drawbacks, such as: lack of scalability in that at present no triple-store can load well beyond several billions of triples; limited range queries, in that no triple-store using this technique allows for range queries on numeric values; and lack of distributed processing capability, in that the same string would have different identifiers, and other similar disadvantages.
To illustrate the drawbacks associated with present graph database systems, consider an example in which it is desired to store 20 billion triples with 5 billion unique strings where each string has on average 25 unicode characters (˜50 bytes). Such a number, while very large, may represent only one month of telephone records in the United States, or all of the people in China, with each person associated with 15 descriptive terms (parts), each part stored in a “slot” of a certain type (e.g., float, integer, etc.). With regard to scalability, for part-to-identifier mapping, the fastest way of interning the strings is to keep the mapping in memory. In this case, a hashtable or a trie (also known as a prefix tree) could be used. However, for 5,000,000,000 unique parts this means that with a simple hash table, a 7,000,000,000 vector on a 64 bit system (=56 Gigabyte) plus the part itself (=250 Gigabyte) plus the link to a unique identifier (requiring two 64 bit integers=80 Gigabyte) would be needed. The total memory requirement would thus be on the order of 386 GB, and this is only for part-to-number mapping. With some clever processing and the use of text tries, this could possibly be reduced to roughly 100 GB, but this is still a great deal of memory in which to store only a single mapping. Although it would be possible to store the part-to-identifier mapping completely on disk in a b-tree (prefix-b-tree, disk-trie, etc.), this means that for large numbers of parts, accessing a mapping becomes ultimately diskbound. This can cause significant processing slowdown, since disk seeks involve computer input/output operations that are typically several orders of magnitude slower than resident memory accesses.
Present graph database systems also present certain drawbacks with regard to sorting and finding two-dimensional data elements in a data space. For one-dimensional indexing, data structures such as B-trees can provide relatively efficient methods for sorting and most one-dimensional data naturally sort in a linear fashion. Two-dimensional data, however, presents a much more complicated challenge. For two-dimensional indexing, several different multi-dimensional indexing algorithms exist. One such algorithm is the R-tree, which defines rectangular regions of space defined by two opposite comers of a rectangle or bounding box (or “minimum bounding rectangles”) which encloses the error-circle. The R-tree data structure splits space with hierarchically nested, and possibly overlapping bounding boxes. In many applications, however, the bounding box may contain too many data points to be truly efficient, thus R-trees generally do not give good performance under extreme scaling. Furthermore, some amount of pre-processing is required to determine the size of the bounding box. Thus, even with R-trees, sorting within a two-dimensional space is typically disadvantageous since it requires examining an overly large portion of the index. The R-tree structure, and other similar structures for two-dimensional data elements also do not translate easily into a linear index. Efficient sorting and searching of linear indexes is a problem well understood by computer science. The embodiments described herein exploit this by converting a two-dimensional area search into a manageable number of simple linear searches.