1. Field of the Invention
The present invention relates to a data model and associated operators. More particularly, the present invention relates to a sheaf data model including a base set and a corresponding graph representing the inclusions of the base set, and associated operators which operate on the sheaf data model.
2. Discussion of the Background
A data model is a theory for describing computer data. The term was introduced by C. F. Codd in the early 1970's to describe the relationship between previous approaches to data management and a relational data model which he had just introduced. Formally, a data model specifies three things: 1) a class of mathematical objects which are used to model data; 2) the operations on those objects; and 3) the constraints between the objects that must be satisfied in a valid database.
The purpose of a data model is to serve as a basis for analysis, design, and implementation of database management systems (DBMS). That is, a DBMS will implement in software (or sometimes in hardware) the operations of the model which allow clients of the system to store and manipulate their data as instances of the objects of the model.
Currently all major DBMS', such as the ones sold under the trademarks ORACLE, INFORMIX and SYBASE, are based on some form of the relational model. To the commercial data management industry, data management is essentially indistinguishable from relational database management system (RDBMS) technology.
In the relational data model, the mathematical objects are relations on domains and the operations are given by relational algebra. The terms relation, domain and relational algebra have detailed, rigorous definitions in mathematics. However, it is possible to understand these terms via a widely used table analogy, and which will be described with reference to the tables shown in FIGS. 1a-1c and 2a-2g. 
A mathematical set is any collection of objects, entities, etc. A domain is a set of values that can be directly represented on the computer, in other words a computer data type. Three very common domains are integer numbers, real numbers, and character strings. Referring to FIG. 1a, a domain 7 is a table 8 with a single column 17 listing all possible values 9 in the domain 7. A name 11 of the domain 7 is a column heading. The number of values in the domain 7 has been selected to be very small to make the table easy to draw, however, in practice the number of values is much larger.
FIG. 1b illustrates a table 10 representing a binary Cartesian product of two sets A and B. The table 10 includes all possible pairs (a,b), where a is a member of set A and b is a member of set B. As shown, the table 10 includes two columns 13 and 15, one for set A and one for set B. FIG. 1b shows the Cartesian product of the domain TINY_INT with itself. Each row in the table 10 includes a pair of values and hence corresponds to a member of the Cartesian product set. Each column 13, 15 corresponds to one of the factors in the product.
In addition, the Cartesian product can be extended to more than just two factors. The n-ary Cartesian product A×B×C× . . . (n factor sets) is a table with n columns, one for each factor. Each row contains n values, one from each one of the factors. In addition, there is a row in the table for each possible combination of values. Each row is called an n-tuple and the n-ary Cartesian product is the set of all such n-tuples.
FIG. 1c illustrates a table 12, which is subset of a Cartesian product set shown in table 10 (see FIG. 1b). Table 12 is a relation and includes the same column headings as table 10. However, table 12 includes only some of the rows of table 10. Thus, table 12 is referred to as a relation, because a subset is selected to represent all the pairs satisfying a predetermined relationship between the two columns 13 and 15. In more detail, FIG. 1c illustrates the relation LESS-THAN-OR-EQUAL in which the value in column 13 of a given row is less than or equal to the value in column 15 of the same row.
A relation schema or relation type is a list of column headings for the table or equivalently a list of factors in the Cartesian product which the relation is a subset of. There are many different possible subsets of the rows of a given Cartesian product set and hence there are many possible relations for a given relation type. The term “relation instance” is used to refer to a specific subset of the rows of a given relation type.
Applications are often analyzed for data base purposes using the dual notions of entity and relationship. An entity is any thing or object in the real world which is distinguishable from all other objects. Entities have attributes. An attribute is a named property that takes its value from some domain. An entity is represented by its set of attribute values and the attribute values identify the entity and describe its state. A relationship is an association between entities.
When the relational model is used to store application data, the application data is typically organized so that a relation represents either an entity in the application or a relationship between entities. FIGS. 2a-2g illustrate an example of a relational model directed to a personnel application including an EMPLOYEE table 14 and a MANAGED_BY table 24 (see FIGS. 2a and 2b). The EMPLOYEE table 14 shown in FIG. 2a is an entity table. Each row in table 14 represents an entity (i.e., an employee) and the columns in table 14 represent attributes of the entity (i.e., an employee_id 16, name 18, job title 20, and salary 22).
The MANAGED_BY table 24 shown in FIG. 2b is a relation corresponding to a relationship between workers and managers. That is, each row in table 24 represents a relationship between two employees, one a manager of the other. The columns in table 24 include the ids 16 of the relevant employees and a manager_id 26.
In addition, because an entity is any thing or object, an attribute value can also be considered as an entity. For example, a name serving as an attribute value of an entity EMPLOYEE may also be considered an entity. Thus, an entity-attribute association can be considered as a relationship between two entities, a primary entity and the attribute entity. This fundamental relationship is referred to as a HAS_A relationship, which is built into the relational data model. That is, the HAS_A relationship is directly represented by the relationship between a table and its columns. Other relationships, such as the MANAGED_BY relationship shown in FIG. 2b, must be represented by additional tables.
Further, a large number of operations may be performed on relations. The operations receive one or more relations (i.e., tables) as an input and produce a relation as an output. The operations are not all independent of each other. That is, some operations can be implemented using other operations. Six fundamental operators in the relational algebra include: 1) Cartesian product, 2) selection, 3) projection, 4) union, 5) intersection, and 6) rename. The Cartesian product operator has been discussed with reference to FIG. 1b. A description of the other five operators will now be given with reference to FIGS. 2c-2f. 
The selection operator receives a table (i.e., a relation) and a row selection condition as an input and outputs a table containing only the rows that match the selection condition. For example, the command “SELECT rows with SALARY>=$100,000 in relation EMPLOYEE” returns a table 28 shown in FIG. 2c. Note the table 28 in FIG. 2c does not have a name. The rename operator (discussed below) allows a table to be named. However, in some instances the table produced by an operator is a temporary result to be used only as input to another operator. In these instances there is no need for the table to have a name.
Another result of a selection operation is shown in FIG. 2d, in which the command “SELECT rows with TITLE=Programmer in relation EMPLOYEE” is executed. As shown, the resulting table 30 includes only the rows with the title “Programmer.”
The projection operator is similar to the selection operator, except it works on columns. That is, the projection operator receives a table and a column selection condition, typically a list of column names as an input and outputs a table including only the selected columns. In addition, because two rows may have a different attribute only in a column not selected by the projection operation, the resulting table may include duplicate rows. In this instance, only one of the duplicate rows is retained, and the others are discarded. FIG. 2e illustrates a result of the projection operation, in which the command “PROJECT columns named NAME in relation EMPLOYEE” is executed. As shown, the projection operation produces a table 32 including all of the employees' names.
The union operator receives two tables as an input and outputs a table including all the rows in either of the input tables. In addition, the union operator can only be used on tables which both have the same relation type (column headings). For example, FIG. 2f illustrates a resultant table 34 from a union operator of the tables shown in FIGS. 2c and 2d. The table 34 is produced by executing the command “UNION relation Table 6 with relation Table 7.” The references to Tables 6 and 7 respectively refer to the tables shown in FIGS. 2c and 2d. 
The intersection operator receives two tables as an input and outputs a table containing all rows that were the same in both tables. Similar to the union operator, the intersection operator can be only used on tables which both have the same relation type. For example, FIG. 2g illustrates a resultant table 36 from an intersection operation of the tables shown in FIGS. 2c and 2d, in which the command “INTERSECT relation Table 6 with relation to Table 7” is executed.
The above-noted operators all produce nameless tables. However, a table must have a name if it is to be later referred to. The rename operator may be executed to perform this function.
The set of operators described above is a primitive set of operators. That is, the set is a minimal set of operations from which other more convenient operations can be built. Practical relational database systems implement a number of other operators, which for simplicity purposes are not described herein.
A database for a particular application is designed by choosing a set of relation types that represent the entities and relationships in the application. This collection of relation types is called the database schema. The details of the mathematics of the relation model place a number of constraints on the relation types in the database schema. A database schema that satisfies these constraints is said to be in normal form and the process of reshaping a candidate database schema design to meet the requirements of the normal form is called normalization. The net effect of normalization is typically to scatter the attributes of an entity across many different tables.
The constraints of the normal form are organized into various stages, such as first normal form, second normal form, etc. The first normal form requires each column in a table to contain atomic data. That is, the domain associated with the column must be some predefined, preferably fixed size type value such as an integer. The reason for this is because the relational operations deal only with the table structure and can not deal with any internal structure associated with the data within a given cell in the table.
The most infamous type of non-atomic data is the array. Frequently, the most natural interpretation of the application entity is it has an attribute which is a variable length collection. For instance, an attribute for an employee might be “skills,” a variable length array of skill keywords. However, this attribute would constitute a non-atomic attribute and hence is forbidden. Typically, the atomic attribute requirement forces the creation of additional tables, such as an EMPLOYEE_SKILLS table, which would cross-reference other employee entities to skill entities. In many applications this is an entirely acceptable approach. However, in several instances (discussed below) this type of processing is unacceptable.
The relational data model was a radical departure from previous data management approaches because it is a mathematical model. Previous ad hoc approaches had mostly focused on how data was to be stored and described how to access the data in terms of how it was stored. This limited the types of queries that could be made and generated massive software maintenance problems whenever the data storage was reorganized.
The relational data model instead described data in terms of abstract mathematical objects and operations. The mathematical abstraction separated how data was accessed from how it was actually stored. Furthermore, the mathematics ensured that the relational algebra was a complete set of query operators. That is, any query within the universe of possible queries defined by the model could be generated by a suitable combination of the fundamental relational algebra operators.
The mathematical abstraction and completeness of the relational algebra meant that sophisticated query processors could be implemented as independent subsystems, without knowledge of the application. This arguably created the database management system as a commercial product and unquestionably revolutionized the database industry.
In spite of the overwhelming success of the relational data model, not all application areas are well served by the model. A first application which is not well suited for the relational model is an application which deals with spatial data. There are a wide variety of applications using data that is spatial or geometric in nature. For example, computer aided design and manufacturing (CAD/CAM) and geographic information systems (GIS) are two well known commercially important examples.
A main focus of systems that deal with spatial data is the need to represent spatial decomposition. For example, in design data, the decomposition into systems, subsystems, and parts is a spatial decomposition. Similarly, in geographical data, the decomposition into states, counties, and cities is a spatial decomposition. Furthermore, these applications frequently exhibit multiple, concurrent decompositions. For instance, geographic systems must represent both physical boundaries and political boundaries.
At the finest level of decomposition, spatial data includes collections of geometric primitives and the topological relationships between the primitives. Geometric primitives include simple geometric shapes like points, lines and polygons, as well as a wide and constantly growing number of mathematically more sophisticated primitives, such as non-uniform-rational-B-splines (NURBS). The topological relationships describe how these geometric patches are connected to form complex structures.
It has long been understood that the relational model is a poor choice for representing spatial data. There are at least two fundamental issues. First, it is difficult to represent the decomposition relationships, especially the topological relationships, in a natural and efficient way. For instance, a polygon has a collection of edges (i.e., a HAS_A relationship) which is naturally represented as an attribute of the polygon entity. However, the first normal form prohibits such variable length collections as attributes. On the other hand, representing the topological relationships in separate relationship tables requires complex, possibly recursive, and frequently inefficient queries to retrieve all the parts of a geometric primitive. Second, the operations of the relational algebra are not well suited to natural spatial queries, such as nearness queries and region queries.
A second application which is not well suited for the relational model is object-oriented programming systems. Object-oriented languages, such as Smalltalk, C++ and Java, facilitate the definition of programmer-defined entity types called classes. Individual entities of these entity types are called objects. Complex entities and entity types are composed primarily using two relationships. First, the HAS_A relationship is used to compose simpler objects into more complex objects. That is, objects have parts which are other objects. An IS_A relationship is used to combine entity types into more complex types.
The IS_A relationship, or inheritance as it is called in the object-oriented paradigm, is a powerful new technique introduced by the object-oriented paradigm. The IS_A relationship is a relationship between entity types, rather than just individual entities. If an entity type MANAGER is specified to inherit type EMPLOYEE, then the MANAGER type is a special type of EMPLOYEE (i.e., a IS_A relationship). Every MANAGER entity has all the attributes every EMPLOYEE entity has, plus any attributes that are specified in type MANAGER. This programming mechanism greatly facilitates the construction of complex software applications by making it much less labor intensive and less error prone to model the natural inheritance relationships found in applications.
In execution, an object-oriented application is a complex network of objects related by the HAS_A and IS_A relationships. The natural notion of data storage for such a system is the notion of object persistence. That is, it should be easy to store an object and all the objects it refers to in a database, thus making the object persist after the program that created it has finished execution. Similarly, it should be easy to retrieve the object when execution resumes.
Attempts to use the relational model to store object-oriented data suffer one of the same difficulties as described above for spatial data, which is complex, recursive HAS_A relationships are difficult to implement in the relational model. A more severe problem is the IS_A relationship can not at all be implemented directly in the relational model. In the context of a relational data base, the IS_A relationship is a relationship between relation types. As discussed above, a relation type is not a relation, but is a set of attributes. Thus, the relation types as such can not be represented or operated on within the model.
A third application area for which the relational model is not well suited, and an increasingly commercially important one, is numerical simulation or scientific computing. Simulation software is aimed at predicting the outcome of complex physical, biological, financial, or other processes by building mathematical models and numerically solving the resulting equations. Defense, petroleum exploration, and medical imaging have been the classical applications for scientific computing. However, as the price of numerical computation has dropped, it is increasingly cost effective to use simulation in a wide range of applications. For example, the manufacturing industry is replacing the conventional design-build-test-redesign product development cycle with a design-simulate-redesign cycle. Similarly, financial trading is directed by market simulations and major metropolitan TV stations produce their own weather simulations, complete with computer generated animations.
Simulations combine features of spatial data and object-oriented data. The results of the simulation usually represent the dependence of some property on space or time. For example, the result may represent the dependence of mechanical stress on position within the product, or a stock price on time, or a temperature on location. Thus, simulation data usually contains embedded spatial data representing the shape of the product, the interesting intervals of time, or the geography of the region of interest. In addition, the space and time dependent properties computed are usually complex mathematical types with important IS_A relationships between them.
In addition to sharing these features with spatial data and object-oriented data, simulation data has another essential feature which is the data sets tend to be very large. The amount of data that must be processed in a simulation is directly proportional to the desired accuracy. The quest for accuracy always requires that the simulations be run at or over the limits of the computational resource.