1. Field of the Invention
The present invention is directed generally to database systems.
2. Description of the Related Art
Large-scale data management and analysis of data including complex relationships is not well supported by currently available database systems. Prior art database systems include a variety of different database designs, each having advantages over the others with respect to particular aspects of data management and analysis. Examples of prior art database systems include relational database systems, hierarchical database systems, network database systems, multidimensional database systems, graph database systems, and object database systems.
The various prior art databases may be understood with reference to the following simplistic example. In this example, a user desires to store the names of a portion of the actors in the movies “Apollo 13” and “Cast Away.” The following Table 1 includes the sample data to be stored in this example.
TABLE 1Sample DataMovieActorCharacterApollo 13Tom HanksJim LovellApollo 13Bill PaxtonFred HaiseApollo 13Kevin BaconJack SwigertApollo 13Gary SiniseKen MattinglyCast AwayTom HanksChuck Noland
Referring to FIG. 1, an exemplary data model 10 for use with a relational database system is provided. The data model 10 includes a movies table 12, a characters table 14, an actors table 16, and a roles table 18.
The movies table 12 includes two fields, “MOVIE ID” and “TITLE.” The characters table 14 includes three fields, “MOVIE ID,” “CHARACTER ID” and “CHARACTER NAME.” The actors table 16 includes two fields, “ACTOR ID” and “NAME.” The roles table 18 includes two fields, “CHARACTER ID,” and “ACTOR ID.”
The letters “PK” next to the “MOVIE ID” field of the movies table 12, the “CHARACTER ID” field of the characters table 14, and the “ACTOR ID” field of the actors table 16 indicate these fields are the primary keys of their respective tables. In the roles table 18, the fields “ACTOR ID” and “CHARACTER ID” together form a compound primary key. The value stored within the primary key uniquely identifies each row of the table. In this example, each of the primary keys may be, for example, a unique integer value.
The letters “FK1” next to the “MOVIE ID” field of the characters table 14 indicate it is a foreign key. A foreign key links the values stored within the field to a value in the primary key of another table. In this case, as indicated by arrow 22, the “MOVIE ID” field of the characters table 14 is linked to the “MOVIE ID” field of the movies table 12.
The roles table 18 links actors to characters without duplicating the entry of either an actor or a character into the tables of the data model 10. In other words, the roles table 18 merely defines the relationship between the actors and characters without storing or providing any additional data. The letters “FK1” next to the “CHARACTER ID” field indicate it is a foreign key. As indicated by arrow 24, the foreign key “CHARACTER ID” of the roles table 18 is linked to the primary key “CHARACTER ID” of the characters table 14. The letters “FK2” next to the “ACTOR ID” field indicate it is a foreign key. As indicated by arrow 26, the foreign key “ACTOR ID” of the roles table 18 is linked to the primary key “ACTOR ID” of the actors table 16.
FIGS. 2-5 provide data views of the movies table 12, the characters table 14, the actors table 16, and the roles table 18, respectively, with the data of Table 1 stored therein. Specifically, table 30 of FIG. 2 provides a data view of the data stored in the movies table 12. Table 32 of FIG. 3 provides a data view of the data stored in the characters table 14. Table 34 of FIG. 4 provides a data view of the data stored in the actors table 16. Table 36 of FIG. 5 provides a data view of the data stored in the roles table 18.
Relational database systems are the most widely adopted standard for Online Transaction Processing (OLTP). They excel at processing tabular data and supporting pre-defined relationships between data elements stored within tables. Further, relational database systems are extremely efficient with locality on the storage layer when persisted to disk. However, structuring queries to extract the desired information for analysis using the pre-defined relationships requires a great deal of skill. For example, to determine the actors appearing in the movie “Apollo 13,” a query would have to join all four of the tables within the data model 10. In some cases, the time required to execute complex queries may be substantial. Further, if the nature of the relationships within the database changes, the tables and the relationships between them may require a time consuming restructuring operation.
FIG. 6 provides a block diagram illustrating the sample movie data stored in a tree-like structure 100 of a hierarchical database system. Within a hierarchical database, each node 110, 114, 118, 122, 126, 130, 134, 138, 142, 146, and 150 stores a single data element and is linked to at most one parent node. However, each node may have multiple child nodes. For example, node 114 is linked to a child node 122 by link 120, a child node 126 by link 124, and a child node 130 by link 128. However, notice the data element “Tom Hanks” is duplicated at nodes 134 and 150. FIG. 7 provides a view of the data stored in the tree-like structure 100 extracted by a traversal of the tree. Because the nodes are organized into a well-known tree-like structure 100, many well-known algorithms for processing and traversing trees may be used to process the data.
Hierarchical database systems excel at processing structured data naturally falling into a hierarchy. Hierarchical database systems may also be extremely fast at traversing discrete data because the structure leverages well-known and optimized tree traversal algorithms. Hierarchical database systems may be used in time-critical high performance applications.
FIG. 8 is a block diagram illustrating the sample movie data stored in a network structure 200 of a network database system is provided. Network database systems are similar to hierarchical database systems; however, the child data elements of a network database system may be linked to multiple parent nodes. Because a node may be linked to multiple parent nodes, the network structure 200 avoids the duplication of the node storing the name “Tom Hanks.” Specifically, the node 210, which stores the name “Tom Hanks,” is linked to node 216 (character “Jim Lovell”) by link 212 and node 218 (character “Chuck Noland”) by link 214. Because the nodes are organized into a network structure 200, instead of the tree-like structure 100, well-known and optimized network algorithms may be used to analyze the data stored therein.
FIG. 9 provides a conceptual illustration of the sample movie data stored in a multi-dimensional structure 300 of a multidimensional database system system. The data elements are addressed according to their corresponding values along a first axis 310 labeled, in this case, “MOVIES” and a second axis 320 labeled, in this case, “Characters.” For illustrative purposes, the first axis 310 has been labeled the x-axis and the second axis 320 the y-axis. The contents of each location 322 (i.e., unique pair of values along the x-axis and y-axis) is an actor who played a particular character selected along the y-axis in a particular movie selected along the x-axis.
Additional axes may be added to increase the number of dimensions of the multi-dimensional structure 300. For example, a third axis 330 labeled, in this case, “TIME” may be added to the multi-dimensional structure 300. The third axis 330 may be labeled the z-axis and may signify time or versions (e.g., remakes) of a particular movie and consequently the combination of a particular movie and a particular character may be used to index more than one actor along the z-axis. For example, the Shakespearean play “Hamlet” has been made into several movies. To determine which actors have played the lead character Hamlet, one need only select the movie “Hamlet,” and the character “Hamlet” for all values of the z-axis. While the multidimensional database system depicted in FIG. 9 has three dimensions, any number of dimensions may be used to construct the multi-dimensional structure 300.
Obviously, not all movies have been remade or have alternate versions. Consequently, many multidimensional database systems suffer from a large number of vacant cells when the data is sparse. Further, a multidimensional database system does not provide a mechanism by which relationships within the data may be determined other than via indexing along each of the dimensions. While multidimensional databases are an excellent option for multidimensional analysis, they lack navigational associative search capabilities similar to those found in graph database systems. On the other hand, multidimensional database systems are extremely effective at Online Analytical Processing (OLAP) and related knowledge discovery mechanisms.
FIG. 10 provides block diagram illustrating the sample movie data stored in a graph structure 400 of a graph database system. A graph database system organizes data into a graph of linked nodes. Generally, the links are referred to as edges.
To create the graph structure 400, a head node 402 for the tables is created. A node is then created for each of the tables of the data model 10 of the relational database system (see FIG. 1) and each node is linked to the head node 402. Specifically, a node 412 is created to correspond to the movies table 12 of FIG. 1 and linked to the head node 402 by a link 413, a node 416 is created to correspond to the characters table 14 of FIG. 1 and linked to the head node 402 by a link 417, and a node 430 is created to correspond to the actors table 16 of FIG. 1 and linked to the head node 402 by a link 431. However, in this example, a node was not created for the roles table 18 of FIG. 1 because that table merely relates the actors table 16 and the characters table 14 which may be accomplished in the graph structure 400 with links.
A node is then created for each record within the movies table 12 (i.e., each instance of a movie within the table). In this example, a node 440 is created for the record corresponding to “Apollo 13” and linked by a link 442 to the “MOVIES” node 412. A node 444 is created for the record corresponding to “Cast Away” and linked by a link 446 to the “MOVIES” node 412. Each of these nodes may be linked to the underlying data, i.e., “MOVIE ID” and “TITLE” illustrated in the movies table 12 of FIG. 1.
A node is then created for each record within the characters table 14 of FIG. 1. For illustrative purposes, FIG. 10 includes only the characters played by Tom Hanks. Consequently, the graph structure 400 includes a node 450 corresponding to the character “Jim Lovell.” The node 450 is linked to the “CHARACTERS” node 416 by a link 452. The graph structure 400 also includes a node 454 corresponding to the character “Chuck Noland,” which is linked to the “CHARACTERS” node 416 by a link 456.
The foreign keys of the relational database system are replaced with links. For example, the foreign key linking the characters to the movies is replaced by a link 460 linking the node 450 (“Jim Lovell”) to the node 440 (“Apollo 13”) and a link 462 linking the node 454 (“Chuck Noland”) to the node 444 (“Cast Away”). The links 460 and 462 are labeled with the term “MOVIE” because the nature of the relationship between linked nodes may be ambiguous. The label defines or specifies the relationship between the nodes. In this case, the label specifies the characters of nodes 450 and 454 appear within the movies of nodes 440 and 444, respectively.
Then, a node is created for each record within the actors table 16 of FIG. 1. In FIG. 10, for illustrative purposes, only the actor “Tom Hanks” has been included. Consequently, the graph structure 400 includes a node 470 corresponding to the actor “Tom Hanks.” The node 470 is linked to the “ACTORS” node 430 by a link 472. The foreign key linking the characters to the actors in FIG. 1 is replaced by a link 476 linking the node 450 (“Jim Lovell”) to the node 470 (“Tom Hanks”) and a link 474 linking the node 454 (“Chuck Noland”) to the node 470 (“Tom Hanks”). The label “ACTOR” on the links 474 and 476 specifies the actor of the node 470 plays the characters of nodes 450 and 454.
In this example, the graph structure 400 of the graph database system preserves both the original table structure, linking each record within a table with a context node (e.g., nodes 412, 416, and 430) and the original foreign key relationships (e.g., links 460, 462, 474, and 476).
Graph database systems excel at structural queries and identifying relationships between nodes using its associative search capabilities. Graph database systems are typically used in data mining applications, but suffer from ambiguous connections created in highly dimensional and interconnected data. Graph database systems also encounter locality problems at the storage layer when persisted to disk.
Finally, custom data structures may be used to construct object database systems. However, such object database systems are typically built for specialized uses and are ill suited for generalized applications.
While each of these database systems provides a partial solution, none of them can perform all data manipulation tasks with equal efficiency. For this reason, many users will use more than one database system to process the same source data thereby leveraging the benefits of each system to perform a desired analysis. For example, a relational database may be used to store the source data, a multidimensional database or object database may be used to analyze the source data, and a graph database may be used to discover how the individual data elements interrelate with one another. Consequently, a need exists for a database system that can efficiently address each of these needs. Prior art database systems are being pushed past their design limits as the demand for data mining of continuous data streams increases. Consequently, a need exists for database systems capable of analyzing source data to extract useful information for that source data in a timely and efficient manner. The present application provides these and other advantages as will be apparent from the following detailed description and accompanying figures.