The earliest business-oriented data processing applications consisted of records of information content collected into flat files based on like record structures. Each department within an enterprise attempted to independently computerize their data-intensive functions (e.g., invoice processing, customer billing). Aspects of the enterprise's data were repeated or replicated at will, as these flat files of information were specifically engineered for each aspect of the separate software applications. There was a massive level of data redundancy in the flat files, which made information very difficult to update completely and often resulted in software applications giving wrong answers or taking inappropriate actions. Flat file computing also meant that there was no enterprise view of data across the spectrum of departmental information. Each software application used a unique physical data storage design and basic device access methods supplied by the hardware manufacturer to implement physical data management within the application. The flat file approach to data processing was extremely error-prone and costly, and provided no strategic view of the enterprise.
This situation led to a universal quest to separate physical data management from the functional software applications and create independent, reusable applications whose sole purpose would be physical database management. These applications became known as Data Base Management Systems or DBMSs.
Because enterprise data often has rich structural complexity that is almost impossible to capture in flat files (i.e., all structural complexity is flattened into two dimensions and hence the term ‘flat’), the first attempts at independent DBMSs used smaller records of largely non-redundant information with either hierarchical or networked ‘pointers’. To choose among these DBMSs it was necessary to decide whether the data was inherently hierarchical or networked. Both approaches proved to have limited scalability. For these and other reasons, hierarchical and network DBMSs met with limited commercial success.
In 1970, E. F. (Ted) Codd of IBM published a relational model of data storage and retrieval for large shared data banks (CODD, E. F., A Relational Model of Data for Large Shared Data Banks, Communications of the ACM 13, 6 (June, 1970), pp. 377-387, hereafter referred to as [Cod70]). Ted Codd's model for organizing database records with limited redundancy is a set-theoretical (mathematical) model that treats data contents the same whether its underlying structures are hierarchical or network-oriented. Though elegant and universal, Ted Codd's approach was not largely adopted through the early 1980s because mainframe computers of that day were incapable of supporting the input/output intensity required by a relational DBMS (RDBMS). As hardware capability increased through the 1980s, most industry practitioners began to line up behind Codd's RDBMS as the best means to manage large shared data banks.
While there are hundreds of thousands of business-oriented software applications today that still operate with flat files and hierarchical or network-based DBMSs, these methods of managing enterprise databases are considered outdated. Information update is now the province of on-line transaction processor (OLTP) sites that are implemented by RDBMSs. Large-scale complex query is increasingly being handled by specialized applications called data warehouses (DWH) that operate on vestiges of Codd's relational model. Ted Codd's relational model today stands astride the world of enterprise computing.
Despite near-universal acceptance as a database management mechanism, there are huge problems with the relational model. The most significant is the lack of relationship definition. Relational models contain almost no clues as to a) how they were created; b) what the various relationships among the data actually are; or c) how the relationships are reflected in the model. Most practitioners and experts who work with the relational model are destabilized by its lack of semantic context. As a result, many experts, including Ted Codd himself (CODD, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Transactions on Database Systems 4, 4 (December, 1979), pp. 397-434, hereafter referred to as [Cod79]), have attempted to create semantic data models (HULL, R., KING, R., Semantic Database Modeling: Survey, Applications, and Research Issues, ACM Computing Surveys, Vol. 19. No. 3, (September, 1987) pp. 201-260, hereafter referred to as [Hull87]). For example, a semantic data model posed by Peter Chen in 1976 (CHEN, P. P., The Entity-Relationship Model—Toward a Unified View of Data, ACM Transactions on Database Systems 1, 1 (March, 1976), pp. 9-36, hereafter referred to as [Che76]) is almost universally accepted as the means to determine the semantic context of a relational model.
There have been many serious attempts to implement semantic data models (SDMs) as DBMSs. None have reached the commercial marketplace, partly because they are more difficult to implement just the required optimizers are difficult propositions), and partly because they are typically more difficult to use in real business settings. Also, the more semantic context a DBMS captures, the more difficult it is to make corrections and changes. Though such changes may have less impact than when the semantics are managed by the functional application, efficient support for semantics in a DBMS is very difficult. Getting a value-based or record-based DBMS to work takes significantly less resources. Another notion as to why semantic data models have not done better in the commercial marketplace is that semantic context does not do much to improve performance (and may actually retard performance). Given that RDBMSs are challenged for performance, it is probably not surprising that more semantic context has not found its way into value-based DBMSs.
With rare exceptions (e.g., Sentences from Lazy Software Ltd), semantic data models are not used as DBMSs. Unlike humans, a computer has no need to understand the semantic context of data. A computer needs only for the DBMS schema definition to be efficient for maintenance and execution. If semantic context helped a computer to be more efficient, there would be a reason to include semantic information in the execution system. There is no reason to believe that semantic context would improve the performance of the relational model. For example, there are no RDBMSs that also implement Peter Chen's E-R semantic data model [Che76]. There are numerous semantic data models and specialized implementation techniques, but there is only one widely accepted means of implementing enterprise databases—an RDBMS running on the relational model.
Background Art—Relational Model:
Ted Codd's relational model (originally [Cod70] but defined for current context in [Cod79]) is the foundation for relational DBMSs, which are today the prevalent implementation for enterprise systems. In a relational representation, which is shown in Drawing 1., “Relational Model”, all data is folded into two-dimensional relations (tables) consisting of attributes (columns) and tuples (rows), such that some combination of the attribute values will uniquely identify each tuple (primary key or PK). Such a 2D table is called a relation (hence, the term relational. The remaining attributes (columns) in a relation (i.e., those that are not part of the PK) are called the non-key attributes or NKs of the relation. Within each relation, integrity is ensured when the NK attributes are dependent upon the whole set of the designated PK attributes and nothing but the whole set of the PK attributes (normalized, normal form, third normal form and 3NF). Creating lookup relations that define the range of acceptable values for an attribute ensures attribute-level integrity.
Relations (tables) are connected to one another only by shared combinations of identifying attributes. PK attributes migrate from relation to relation, where they are known as foreign keys (FKs). FK migration begins with the primary relations, which are the relations with only one attribute in the PK and no FKs. Relations with multiple attributes in the PK are called associative relations. Information is typically extracted from a relational model by projecting and/or selecting, then joining the resulting relations according to the FKs and, finally, filtering the results table based on any attribute value limitations specified in the transaction. Structured Query Language (SQL) can specify the data manipulation involved in projecting, selecting, joining and filtering. When an enterprise database structure has been captured in its correct (canonical or normalized) relational format, there is no redundancy in the stored NK attributes.
The best definition of the relational model can be found in [Cod79], Section 2. “The Relational Model”, which defines succinctly that which is generally accepted as the relational model. The remainder of the paper ([Cod79]) proposes an extended relational model to capture more of the meaning of the data. Codd's extended relational model sought to move the relational model toward a more semantic data model with at least four “personalities”. The extended relational model was not adopted by industry.
Advantages of the Relational Model:
The relational model has a highly evolved structure that facilitates data retrieval and a ero data redundancy goal for NK attributes that ensures data integrity. The relational model also separates logical data representation from the physical implementation. The data repository can be plied with a relational calculus implemented by a highly evolved and broadly standardized SQL, for example, as specified in [SQL92].
Relational Model Advantage—Application-Oriented Structure:
The process for evolving a relational model causes the data model to exactly mimic the business model. Each relation is created to exactly support some number of end-user views of data that are required by the software application. Once the correct relational tuples are located in the relational model, all of the necessary attribute values (instances) for the particular user view are directly contained within the structure of the tuple. No additional data need be accessed or retrieved (except, sometimes, lookup values that can be stored in memory).
Relational Model Advantage—Limited Data Redundancy:
The normal (canonical) form of the relational model ensures that no tuple of attributes is stored more than once. This has the advantage of reducing the amount of data to be stored and ensuring that updates to non-key attributes will be performed without anomalies. Each question posed to an RDBMS should only be answered in one way (when multiple replicated tuples are involved in a data model there exists the possibility that the data replicas will be out-of-synch and, thus, provide wrong answers to questions). The elimination of data redundancy from the flat file model was one of the driving features of the relational model.
Relational Model Advantage—Relational Calculus and SQL:
The relational model has a highly evolved relational calculus for manipulating the elements of the model and a broadly standardized SQL for plying the relational data structures.
Problems Involved in the Relational Model:
The principal problem involved with the relational model is that its fundamental data structure is both highly evolved and application-dependent. In simplest terms, the “row×column” structure of a relation lacks symmetry. The attributes that are bound into the various relations, the designations of the PK attributes and the nature and structure of the interconnecting FKs are as unique to a given enterprise's operation as fingerprints are to a person. Relational application software, written in SQL, relies completely on the structure of the relations and business rules.
Each aspect of the relational model maps to many different views/interfaces in the application and each view/interface in the software application interfaces with many different aspects of the relational model. This many-to-many relationship between the application software and the representation of the data has the effect of binding together the application and data so tightly that a minor change in either can have major consequences on the other. Because the relational model's application-dependent data structures (ADDS) are unique, complex and fragile (resistant to change), the ADDS feature of the relational model causes many severe problems in the software applications.
Relational Model Problem Involved—ADDS vs. Relational COTS:
ADDS makes the distribution and maintenance of commercial off-the-shelf (COTS) application software a very difficult business proposition. Because each application is inherently unique (by virtue of the ADDS uniqueness), it is difficult to achieve the economy of scale necessary to sustain a business distributing COTS applications. If the data model is made more flexible (complex) to accommodate a range of variation in the underlying business model (such as varying operational practices within an industry), the artifacts of this flexibility will cause the application software's complexity to increase, proportionally, to handle the increased operational flexibility (because SQL is tightly bonded to the data structure). Although any given implementation only uses a small portion of the ADDS-induced flexibility, every implementation must be able to fully identify and understand the flexibility. The relational equation—flexibility implies unnecessary complexity—is inescapable with respect to any given implementation of the COTS software.
Relational applications that are built on such flexible relational model data structures often suffer operational difficulties and failures that are not caused by any aspect of the particular enterprise's business model, but rather are caused by unneeded and unused artifacts of flexibility (included in the COTS application to accommodate competitors' business models).
In addition, identifying, understanding and ignoring the unused artifacts of induced flexibility in the application and the data model necessarily consumes the most talented human IT resources of the enterprise, as well as other institutional resources such as data storage, processing, training and documentation. As the COTS software vendor's customer base continues to grow, the need for flexibility and attendant complexity increases until the existing customer base begins to reject the COTS application. This often results in the failure of the COTS application vendor.
It is fair to state that the ADDS inherent in the relational model are the root cause of many if not most commercial failures of COTS application software vendors. [i.e., a) commercial viability requires vast flexibility, b) because of ADDS, vast flexibility implies unnecessary complexity (with respect to any given customer), and c) ever increasing and unnecessary complexity causes customers to stop using the application.]
Relational Model Problem Involved—Long Frozen Baselines:
For a given relational software application, the complete structure of the relations must be known/defined before the coding of the application software can begin. Thus, because of ADDS, long frozen requirements baselines are necessary so that the relational modelers can ascertain and document the specifics of the enterprise's data structure (it is not uncommon for enterprises with complex data structures to require frozen requirements baselines measured in years). The complexity is so extreme that it is often necessary to use semantic data models to first describe the data model in abstract terms. Because the underlying data model of the enterprise does change during the period of the frozen baseline (sometimes considerably), relational application software typically lags behind the true enterprise model, often to a substantial degree.
Relational Model Problem Involved—Resistance to Change:
Because of the ADDS and the tight bond with SQL, relational software applications are highly resistant to even the smallest changes in the structure of the relations. The most sensitive problems in this area are involved with changes in the single PK attributes of the primary relations. Any given PK for a primary relation can migrate to become FKs in a substantial number of relations within the enterprise's relational model. Any changes in such PK attributes can induce changes throughout a relational model, as well as the associated application software. This implies that minor changes can be very costly to make, and necessary changes are often avoided with costly workarounds (because the cost to correct the problem in the application is even greater). As a direct result of ADDS and FK migration, relational applications typically have years of unapplied application change requests in backlog. This is why COTS enterprise applications based on the relational model typically have a low degree of fit to the customer's needs.
Relational Model Problem Involved—Structural Degradation:
On the typical enterprise application, RDBMSs are faced with a physical reorganization problem relative to the basic tabular organization of information. The problem, which is attendant to the ADDS, is that relational information is stored in a highly-organized form which should be at least clustered if not physically contiguous on the storage medium. However, a table is only represented in its pristine structural form on the storage medium when it is first loaded (or re-loaded). When rows are deleted and inserted, this clustered physical arrangement on disk cannot be maintained. Over time, the RDBMS will gradually lose much of its throughput efficiency because of the relational model's reliance on structured data storage. The enterprise may simply accept this loss of efficiency or correct it by periodically reorganizing the tables on the physical storage medium.
The problem of structural degradation is inherent in relational data management. Reliance on complex physical data structures is the problem. All complex physical data storage structures degrade over time. When the DBMS relies on the physical structures for organization, processing efficiency degrades in proportion to the erosion in the physical structures.
Relational Model Problem Involved—Need to De-Normalize:
Normalization is considered a key theoretical strength of the relational model. In practice, it is one of the great weaknesses of the model. It is a rare implementation of the relational model that can afford the inefficiency involved in full normalization. The join operation is extremely inefficient and the greater the degree of normalization the more joins must be performed. De-normalizing the structures of the relational model provides for operational efficiency but leads to update anomalies and significant redundant data that then must be plied to support operations. This is especially true of the requirement to keep the data replicates updated in synch with the copies of record. All of this complexity must be added to the application. Complex queries require so many joins, that databases on the relational model must often be de-normalized further just to support complex query. When the update processors (OLTP) cannot support the added redundancy (de-normalization) that is required for complex query, it is frequently necessary to deploy separate DWH sites just to support the complex query.
Relational Model Problem Involved—Elemental Security:
The security requirements of enterprise applications are difficult to predict. Customers often require that security classifications (for access to data) be at the row or column (attribute) level. In some cases, customers may require security at the ‘cell’ level of a table (row×column). There are no practical ways to implement security labels within the cells of two-dimensional table structures (certainly none that are efficient). This is so because security is actually a different, third dimension to the two-dimensional table. A certain ‘row-column’ combination (two-dimensions) might require no security, or one or more security labels. If the security dimension is collapsed (folded) onto the two-dimensional table, as is required by the relational model, the result is not good.
This is an example of one area where induced flexibility makes the relational model untenable. To make applications broadly useful, the underlying data model must allow for cell-level security. Yet, cell-level security may only be invoked in a few instances. The rigid relational table structure would have to be built to handle security labels on all instances (cells). This approach is enormously inefficient. One alternate possibility is to create adjunct table structures (category relations) with only the row-column values that require the security labels. In a secure mode, this doubles the number of reads that are required to process transactions.
The more dimensions of complexity that exist in the application (such as security) the more difficult it is to implement the application in a two-dimensional relational depiction.
Relational Model Problem Involved—Inability to Interface:
Because of ADDS, the ability to read the information in a relational repository is effectively limited to the application for which the repository was erected. This is ironic because Codd's original intent for the relational model [Cod70] was to facilitate access to large shared data banks. To share a relational database that was erected for/by a different application and use its data effectively, a software application must learn the complicated ADDS of the ‘owning’ software application. Over time, as each application changes its business model and its underlying ADDS, all other applications that must share the application's data must be similarly modified to keep pace. For this reason, making even one minor change in a tapestry of interconnecting relational applications can be like throwing a rock into a still pond. The waves of change must propagate throughout the interconnecting applications. This has a profound effect on the structure and composition of relational applications. For example, most relational software applications do not eliminate unneeded attributes from their relations. It is more cost-effective to ignore that which is no longer necessary than it is to make such minor changes and suffer the wide-ranging effects upon the structure and composition of all interfacing applications. Over time, the persistence of unneeded data structure and attribution in a relational implementation can cause additional problems in complexity, storage space, and operations.
Relational Model Problem Involved—Inability to Model Complex Interrelationships:
The relational model has no real mechanism for modeling complex interrelationships among data. The most basic relationships are encapsulated in rigid sets of attributes. All other relationships are simply “embodied” within the FKs that are repeated among the relations. Therefore, the relational model does not define in any semantic manner the relationships among the relations. In 1976, Peter Chen [Che76] defined one of the first semantic data models (the Entity-Relationship or E-R model) to extend the relational model to address this problem. To define and explain the relationships among PKs and FKs in the relational model, the E-R model adds the notion of named “business rules” (among entities) with cardinality (e.g., 1:1, 1:M, 1:0,1M, etc.). The relational model and the E-R models are so generally intermixed in the real world that most people, and indeed most experts in the field, view them as one and the same data model. In [Cod79], Ted Codd both corrected this notion and attempted, himself, to extend the relational model to a semantic data model to encompass the definition and classification of complex relationships among the relations.
In short, the relational model presents a means of depicting data (an end-state description) and an add-insert-delete and query calculus. The relational model is not a means of understanding, analyzing, defining, designing, describing or improving the complex interrelationships that exist among the data. To do that, generally requires a semantic data model, of which there are many to choose from (including Chen's E-R model [Che76]).
Relational Model Problem Involved—OLTP vs. DWH:
When large and very large databases are implemented with the relational model, the contention between query and update transactions can cause severe performance difficulties. Mission critical enterprise applications generally require that updates be applied in real time. This is known as On-Line Transaction Processing (OLTP). However, enterprises also need to query the database, and many of the required queries are elaborate and/or complex. To support such queries on large databases requires secondary indices on each of the columns that are mentioned in the SQL “select” and “where” clauses. Because these secondary indices must be maintained during updates, they add substantial overhead to the update transactions.
Given that joining tables can be time-consuming and resource-intensive, complex queries also benefit from reducing the number of physical tables that must be accessed. A popular approach is to pre-join (or de-normalize) the tables to reduce the number of joins. Though most OLTP sites do de-normalize the databases somewhat, to improve performance, they typically do not do nearly enough de-normalization to support complex queries. Moreover, as the amount of de-normalization and secondary indices increases, OLTP response times get appreciably worse. Because of this contention, complex query support on large and very large databases has been increasingly implemented on separate computing suites known as data warehouses.
OLTP installations are optimized to best support real-time update, while still supporting enough queries to meet the tactical objectives of day-to-day enterprise operations. DWH implementations are optimized to support unrestricted complex query on large and very large databases, and as infrequent and minimally disruptive of an update process as is consistent with strategic assessment and decision support. OLTP supports the least amount of columnar inversion of the database tables (secondary indices) that is minimally consistent with tactical needs. DWHs support the maximum amount of columnar inversion of the database tables that reasonably contributes to efficiency in supporting complex queries. The current direction of the OLTP and DWH markets is to drive these two computing paradigms farther and farther apart, not closer together.
There would be a sizable advantage to conducting OLTP and DWH on the same copy of the database, provided that the solution was cost-effective. It would, for example, simplify training, lower maintenance costs, improve the currency and accuracy of data, and put strategic support within reach of many enterprises that today cannot afford both an OLTP and a DWH solution. It is simply that the DBMS industry views the separation as a necessary evil.
Background Art—Semantic Data Model:
As mentioned above, the relational model provides no mechanism for modeling the complex interrelationships either within or among its relations. Like the hierarchical and network data models that came before it, the relational model is termed a value-based or record-based model. Semantic models were first introduced as tools to understand, analyze, define, design, describe and improve data interrelationships. Complex data schemas could be modeled in a semantic tool and then translated into the relational model for implementation on an RDBMS.
The first published semantic data model, the Semantic Binary Data Model (SBDM) appeared in 1974, (ABRIAL, J. R., Data Semantics, Data Base Management, North Holland, Amsterdam, (1974) pp. 1-59, hereafter referred to as [Abr74]). In 1976, Peter Chen published the most commercially successful semantic data model of all, the E-Rmodel [Che76], to address the complete lack of semantic context in the relational model. So successful was the E-R data model at extending the relational model, that most database practitioners today regard the relational model and the E-R model as one and the same model. They are not. The relational model is implemented by an RDBMS while the E-R model is implemented in an entity-relationship diagram. The “business rules” introduced by the E-R model are not supported, per se, by an RDBMS. The success of the E-R model (compared to the more robust semantic models developed by, among others, Codd himself [Cod79]) is rooted in its simplicity.
One grouping of semantic data models that is related to this invention including a Knowledge Operating System or KnOS is the one that describes data as two constructs: entity sets and binary relations. These semantic binary data models(e.g., [Abr74], BRACCHI, G., PAOLINI, P., and PELAGATTI, G., Binary Logical Associations in Data Modeling, Modeling in Data Base Management Systems. North Holland, Amerstam, (1976) pp. 125-148, hereafter referred to as [BPP76], DEHENEFFE, C., HENNEBERT, H., and PAULUS, W., Relational model for a Data Base, Proceedings of the IFIP Congress, (1974) pp. 1022-1025, hereafter referred to as [DHP74], HAINAUT, J. L., and LECHARLIER, B., An Extensible Semantic Model of Database and its Data Language, Proceedings of the IFIP Congress, (1974) pp. 1026-1030, hereafter referred to as [HaL74], and SENKO, M. E., Information Systems: Records, Relations, Sets, Entities, and Things, Information Systems 1, 1, (1975) pp. 3-13, hereafter referred to as [Sen75]) treat each binary relation as an inverse pair of possibly multi-valued functions (see Drawing 6, “Semantic Binary Data Model”). The SBDMs and the present invention have the same core construct (sets of binary relations in inverse pairs of possibly multi-valued functions), but the present invention is not a semantic data model, per se, in that it does not seek to define relationships. Instead, the present invention treats the binary relations as basic, unnamed “associations”, in much the same way that the relational model implements the net results of the E-R model.
In short, people crave semantic context; computers and DBMSs do not. DBMSs benefit only from efficiency and maintainability. If the inclusion of semantic context makes a DBMS less efficient or the database/application less maintainable, it is a weak candidate for addition to a DBMS. Unfortunately, semantic data models do both (less efficient and less maintainable). The more semantic context a DBMS captures, the more changes are required when one discovers either that mistakes were made or that the enterprise model has changed. If this were not true, all of the commercial RDBMSs would have absorbed at least the E-R semantic data model [Che76] into their basic framework.
Advantages of the Semantic Data Model:
One of the better treatments of semantic models was produced by Richard Hull and Roger King in 1987, “Semantic Database Modeling: Survey, Applications, and Research Issues” [Hull87]. Semantic data modeling techniques provide three principal advantages over record-based or value-based models such as the relational model.
Increased separation of conceptual and physical components. In the relational model, the access paths available to end-users tend to mimic the logical structure of the database schema directly (by comparing identifiers in order to traverse from one relation to another). The attributes of semantic models are used as direct conceptual pointers. Decreased semantic overloading of relationship types. The relational model only has two constructs for recording relationships among objects: a) by binding attributes within relations and b) by using the same values within two or more relations. It is for this latter reason, that the relational model is often called “value based”.
Availability of convenient abstraction mechanisms. Semantic models provide a variety of convenient mechanisms for viewing and accessing the schema at different levels of abstraction.
Peter Chen's Entity-Relationship (E-R) model [Che76] is the most widely used and widely recognized semantic data model. The E-R model is the standard semantic data model for understanding, analyzing, defining, designing, describing and improving a relational model.
Problems Involved in the Semantic Data Model:
Semantic data models have been widely used in schema design but have not experienced much commercial success as a means of implementing computing solutions. There are several reasons for this.
Most of the semantic data models are not implementation models (they are used for schema design). For example, the E-R model is not an implementation model.
Many semantic data models are used by the object-oriented (OO) behavioral computing paradigm. OO has suffered from a lack of success because many enterprise data models do not sustain a broad-based crisp boundary condition (the core requisite of OO).
Many semantic data models have a narrow focus of interest, which is not well suited to commercialization as an implementation model.
Many semantic data models are too complex to gain wide-scale usage. Codd's own semantic extension to the relational model [Cod79] gained no following because of its theoretical complexity. The very simple E-R model is the standard semantic data model for defining relational models.
The simplicity inherent in the relational model permits the development of powerful, non-procedural query languages such as ANSI SQL. Many semantic data models can be mapped to a relational model for implementation, thereby taking advantage of the general computing features of relational modeling. As a result, many semantic data models are implemented on the relational model by an RDBMS.
There is no body of expert opinion today to suggest that semantic data models will ever replace record-based data models (relational, network, hierarchical) for implementation. But, it is likely that semantic data models will always have a place in the field of schema analysis and design.