Data relationships are often modeled as a graph with the vertices of the graph representing entities or non-entities, such as literals, and the edges of the graph representing relationships between the entities or literals. The vertices representing entities are referred to as entity vertices, and the vertices representing non-entities are referred to as literal vertices or non-entity vertices. Each entity vertex has an entity type indicating the type of entity that the entity vertex represents and an entity identifier uniquely identifying that entity vertex, each literal vertex has a value that is a literal, and each edge has an edge type. For example, in the medical field, a graph may have entity vertices representing people, procedures (e.g., appendectomies), hospitals, prescriptions, and so on. If a physician, John Smith, has written 100 prescriptions, then the graph may contain one entity vertex for John Smith with the entity type of person and an entity identifier that is a unique identifier, such as “John Smith” or his Social Security number. The graph also contains one entity vertex for each of the prescriptions with the entity type of prescription and an entity identifier that uniquely identifies the prescription, such as a globally unique Rx number. The graph also contains 100 directed edges with each edge connecting an entity vertex representing a prescription to the entity vertex representing John Smith. Each edge may have the edge type of “prescribed by” to indicate that John Smith prescribed that prescription. The graph may also include a literal vertex with a literal of “Jan. 1, 1964” and include an edge with an edge type of “date of birth” that connects the entity vertex representing John Smith to that literal vertex indication that John Smith's date of birth is Jan. 1, 1964. As another example, in the electronic commerce field, a graph may have entity vertices representing customers, web sites, products, purchases, and so on. If a customer visits a web site, the entity vertex representing that customer may have an edge going to the entity vertex representing that web site with an edge type of “visited.”
Data relationships can also be modeled as a property graph with the vertices of the graph representing only entities and with non-entity information stored as attributes or properties of an entity vertex. For example, the entity vertex representing John Smith would have a date of birth property with the value of “Jan. 1, 1964.” In the following, the term “graph” encompasses both a property graph and a non-property graph. A non-property graph is a graph that includes non-entity vertices for storing non-entity information. Also, the context surrounding the use of the term “graph” will make it clear whether a non-property graph or a property graph is being described.
Large collections of data can store data for hundreds of different types of entities with hundreds of different types of relationships. These collections can store data for millions of entities. In the medical field example, a collection may store information for 100 million people, 200 million prescriptions, and 500 million office visits, and the number of relationships between the entities may number in the billions. The graph corresponding to these collections may have billions of entity vertices, even more literal vertices, and billions of edges.
Collections of data representing relationships between entities, such as these large collections, may be referred to as instances of semantic data models that allow the relationships to be modeled as “facts.” The facts are often represented as triples that have a subject, a predicate, and an object. For example, one triple may have the subject of “John Smith,” the predicate of “is-a,” and the object of “physician,” which may be represented as                <John Smith, is-a, physician>.This triple represents the fact that John Smith is a physician. Another triple may be        <John Smith, graduate of, University of Washington>representing the fact that John Smith graduated from the University of Washington. Yet another triple is        <John Smith, degree, MD>representing the fact that John Smith has an MD degree. The fact that prescription P1 was prescribed by John Smith may be represented by the following triple:        <P1, prescribed by, John Smith>.        
With such a semantic model, the subjects are entities, the predicates are relationships, and the objects are entities or literals representing attributes of entities, referred to as entity objects and literal objects. For example, John Smith and the University of Washington are entities, and “physician” and “MD” are attributes of John Smith. In a graph model of a collection of triples, the graph contains an entity vertex for each entity, a literal vertex for each attribute value, and an edge for each triple. The edge representation of a triple connects the entity vertex corresponding to the subject to the entity vertex or literal vertex corresponding to the object, with the edge having an edge type corresponding to the predicate, Some collections of data may store triples that are from different sources and that correspond to different graphs. For example, one collection may have triples collected from the U.S. Center for Disease Control (“CDC”) and from various medical insurance companies. For such a collection, each triple may be part of a quadruple that includes an identifier of the source that represents a separate graph. For example, if the data from Blue Cross and the CDC both contain the following triple                <P1, prescribed by, John Smith>,then the collection may contain the following quadruples        <CDC. P1, prescribed by, John Smith> and        <BC, P1, prescribed by, John Smith>.        
The first element in the quadruples identifies the graph of which the corresponding triple is part. The triples for Blue Cross and the triples for the CDC represent separate graphs stored as part of the same collection.
Semantic data models can be used to model the relationships between any pairs of types of entity, such as web pages, people, companies, products, meetings, celestial objects, proteins, purchases, telephone numbers, telephone calls, and so on. One semantic data model, referred to as the Resource Description Framework (“RDF”), has been developed by the World Wide Web Consortium (“W3C”) to model web resources, but it can be used to model any type of resource. The triples of a semantic data model may be stored in a semantic database that may include a fact table containing the triples representing the facts.
These large collections of data may contain a wealth of information. Various data mining techniques have been developed to assist in extracting useful information from these large collections. The data mining techniques may use classification and clustering techniques (e.g., Bayesian classification), pattern analysis techniques, and so on. If a researcher “understands” the data of a collection, the researcher could select the appropriate data mining tool to extract useful information, realize that the collection lacks sufficient data and augment the collection, realize that two separate collections are incompatible in some way and try to make them compatible, and so on. But without this “understanding,” the researcher is at a disadvantage and may spend significant time and computational resources extracting information that may not be particularly useful.