1. Field of the Invention
Database accessing that supports identifying relations amongst individual data elements (as distinct from the efficient accessing of discrete, individual data elements) has grown in power and utility. Businesses are able to obtain valuable new business insights by using methods for accessing and viewing data that support combinations, re-combinations, or analyses of both existing data elements and structures, combinations, or relations of said data elements. Several major corporations (e.g. Oracle Corporation) have shown that a relational database (“RDB”) and a relational database management system (“RDBMS”) that enable more flexible database accessing are valuable.
This invention primarily implements a methodology for uniform handling of data elements, structures, and relations denoted in and forming a relational database by the relational database management system or by users thereof without requiring explicit and hardware-dependent memory management, though it also handles the relations manipulated by and in a relational database or by users thereof so as to optimize query processing, table management, transaction handling, and distributed or remote database maintenance.
2. Description of the Related Art
A Relational Database Management System (‘RDBMS’) is a software system for creating, maintaining, and using a Relational Database (‘RDB’). An RDB is a means for representing data elements and operations on said data elements via the relational model (or some variant on the relational model such as the commonly available SQL packages), where the RDB as a whole serves as a logical model for the sub-portion of the real world instantiated in the RDB. The RDBMS includes, among other elements, both a System Catalog that contains the definitions of the logical model as represented in the physical memory, and the respective denotations thereof which serve as symbolic abstractions for the relations and constraints comprising the RDB; and a Query Language Processing Engine for executing relational request(s) wherein said requests contain certain allowed processor operations. The allowed processor operations include logical operations (e.g. ‘AND’, ‘OR’, ‘NOT’) and relational operations (e.g., join, product, difference, divide, intersection, restriction, projection, aggregation, union, grouping, and partitioning); they may also include mathematical operations, including both direct processor function calls and mathematical algorithms (e.g. ‘PLUS’, ‘SUM’, ‘AVERAGE’); and allowed character, text, and graphical operations (e.g. ‘NAME’, ‘CHART’) provided for within the RDBMS for data input, manipulation, and output. The System Catalog and its contents are accessible to, and are often modifiable by, the Query Language Processing Engine. System Catalogs are implemented in various forms, as is well known to those familiar with the art. For example, the System Catalog may be human-readable, compiled or otherwise embedded in programmatic code, encrypted, stored as relations, may be static or active, and so on. Either or both of the System Catalog and the Query Language Processing Engine may be implemented internal to the RDB, external to the RDB, or in some combination of internal and external implementation.
There are numerous functionally equivalent symbolic abstractions, well known to those familiar with the art, that can be used for expressing and manipulating the semantics of sets including, for example, those for set theory, predicate logic, relational algebra, and relational calculus. A Set is a collection of data elements, representable by and satisfying a logical predicate (often referred to as a ‘membership function’ or ‘membership criteria’), wherein each data element belonging to a set shares at least one property that is common to its set's members, yet uniquely distinguishes them from any other data element not belonging to that set; and the logical predicate satisfied by each member describes the necessary and sufficient properties for belonging to that set. An abstract symbolic expression such as a logical predicate which either fully or partially defines a set's members is referred to here as a Membership Abstraction. The logical predicate contains one or more variable terms (‘predicate variables’), each of which may take values pertaining a property of the set; and may contain one or more constant terms as well. Every element of a set is distinguished by some property so that a particular element occurs at most once in any particular set; every element is unique. The Relational Database (‘RDB’) is a database wherein the data is organized into rows (known formally as ‘tuples’) which are further grouped into Sets known as Relations, each said Relation having (either implicitly or explicitly) a distinguishing property or properties grouping a Set's elements together and distinguishing them from non-members; and the elements of the Set being the rows of the Relation. The standard instantiation of a Relation is a table. The single-variable terms of the logical predicate pertaining to the Set and which the Relation represents each refer to a shared property of the Set and are represented by a column (also known as an attribute) of the Relation; the number of predicate variables in the logical predicate is the number of columns in the Relation which represents the Set. The values which a particular predicate variable may take within a variable term of the logical predicate are the permissible values of the Relation's column; that is, each column is defined as taking the values of a particular ‘domain’ (a set of values), and the value of a particular column in a particular row being exactly one such value. Multi-variable terms in the logical predicate contain only variables that are each individually represented by some column of the Relation. The logical predicate must evaluate to ‘True’ on substitution of each predicate variable therein with the corresponding values in the columns of any particular row of a Relation. Relations typically have a time-varying membership; at any given time only some subset of rows belong to the corresponding Set of all those that might permissibly belong given solely the terms of the logical predicate whose Truth or Falsity depend only on recorded values of data elements. In order to capture the time-varying aspect of Set membership, the logical predicate may be considered as being augmented with a special constant term called an ‘assertion predicate’ by which a suitably authorized user may assert that a particular permissible member either does or does not belong to the Set. A relational insertion operation thus corresponds to identifying the set of zero or more potential member rows that satisfy some logical condition or conditions and setting the value of the assertion predicate to ‘True’ for these rows; a relational deletion operation corresponds to identifying the set of zero or more member rows that satisfy some logical condition or conditions and setting the value of the assertion predicate to ‘False’ for these rows. In practice, no RDBMS implementation of insertion and deletion operations have been manifestations of relational insertion or relational deletion as defined above; often the RDBMS implements row by row modifications (including deletion, insertion, or update) of the Relation; and the RDBMS offers no explicit support for the assertion predicate.
The uniqueness of the rows in the Set pertaining to the Relation is determined entirely by the values in those rows; two rows in a particular relation are not unique if the values of corresponding columns are identical for every column value. Each Relation is denoted by the RDBMS in a form that serves as a symbolic abstraction that can be manipulated via relational logic. In practice, most current RDBMS implementations permit access and manipulation of ‘tables’ (the standard instantiation of relations). Some tables are not strictly Relations inasmuch as they permit duplicate rows, rows that contain undefined property values (often designated with special markers called ‘nulls’), rows with dissimilar semantics, default values, and so on. The processing of requests involving such tables is (1) less uniform than that for Relations, (2) not prescribed by the relational model, (3) may result in anomalous results not explicitly predicted by the relational model, and (4) unique to the particular RDBMS implementation.
A Relation is commonly known to and represented within an RDB as a table having rows and columns, and is a particular type of Set whose members are both rows and satisfy both (1) the logical predicate defining potential membership in the Set and referencing no other sets, and (2) the assertion predicate, a predicate asserting that those members belong to said Relation (i.e., are actual, rather than just potential, members of the Set). A Relation Predicate is the logical predicate corresponding to a Relation and describes the necessary properties for a row to belong to the Relation. All rows having said necessary properties could, but need not be, members of the Relation; while rows with the necessary properties are potential members of the Relation, if and only if these potential members have also been asserted to be members of the Relation by some suitably authorized user of the RDBMS.
For example, an ‘Employees’ Relation might have columns for Employee Number (ENUM), Employee Name (ENAME), Employee Salary (ESAL), and Employee Department Number (EDEPT). The ‘Employees’ relation will have a Relation Predicate Emp(x) that stands for the logical definition of the Relation; e.g. ‘Emp(x)’ means that:‘x is an Employee AND x has been assigned Employee Number ENUM AND x has Employee Name ENAME and x earns Employee Salary ESAL and x works in Department Number EDEPT’. The actual members of the ‘Employees’ Relation are those rows that have been entered into the RDB, and therefore both have the properties specified by Emp(x) and have been asserted to belong to the Relation ‘Employees’. (Note that x is a symbol representing an arbitrary entity commonly referred to as an ‘employee.’)
In practice, the Relations in an RDB are most often defined implicitly, with row membership in a Relation being specified on a combination of user assertion and satisfaction of Constraints. Most RDBMS's use Constraints to manage their data. (Date & McGoveran, “How to Avoid Database Redundancy”, Database Programming & Design, Vol. 7 No. 7, July 1994, p. 46, 48.) A Constraint is a condition that the RDBMS tests against for a truth value; it is also a means for ensuring the RDB's integrity, as a Constraint is used to ‘constrain’ the RDB's data and Relations to those permissible (according to the designers and builders) and according the proper interpretation of the RDB's meaning. Constraints are defined, classified (e.g., domain, column, row, relation, or multi-relation), enforced, maintained, and accessible to the RDBMS. Each Constraint may be expressed as a logical predicate or its equivalent, and so denoted within the System Catalog as a symbolic abstraction. Domain constraints are used to determine whether a data element belongs to a given domain. (E.g. is the value represented a character? a number? A computer distinguishes between the numeral ‘1’ and the number ‘1’, between the letter ‘x’, a potential but undetermined set member ‘x’, and a variable ‘x’.) A column constraint (also known as an attribute constraint) requires data elements within a particular column to belong to a specified domain (i.e. valid entries are those possessing a specific attribute; e.g. American salary values are in ‘dollars’). A particular relation constraint limits membership in a particular Relation (all members of this Relation satisfy the conditions of the constraint). And multi-relation constraints, also referred to sometimes as ‘database’ constraints, are conditions which must be satisfied by multiple relations. A referential integrity constraint is a particular type of database constraint. Within a transaction, RDBMS programs may check to see whether domain, column, and relation constraints are satisfied after each individual, subordinate operation, but must check multi-relation constraints after all operations on the referenced tables are completed (since a failure after an intermediate operation might be corrected by a subsequent operation on one of the referenced relations). In practice, the combination of explicitly defined constraints known to and enforceable by the RDBMS is incomplete in that it does not completely define the membership of the Relation, requiring a combination of extreme care on the part of the user and external filtering of attempted updates using, for example, application programs. In practice, errors due to incomplete or inaccurate implementation of constraints are common.
RDB designers and users could refer to the logical description of a Relation within the RDB by using a Relation Predicate. (Date & McGoveran, “Updating Joins and Other Views”, Relational Database Writings 1991-1994, Part II, Chapter 6, pp. 267-284.) A Relation Predicate is a portion of the logical predicate for the Set which the Relation represents, including all terms of that logical predicate excepting the assertion predicate. A Relation Predicate properly expresses the correct (as asserted by the RDBMS's user) interpretation of a relation; i.e. it is the expression of the ‘meaning’ of the Relation. By extension, it is the expression of the ‘meaning’ of a table in that RDB insofar as the meaning of that table may be made unambiguous. The Relation Predicate will join together the logical and relational predicates that constrain the relation's data, and allow the user to understand them. For example, a one-row, three column Table ‘Date’, with values ‘01’, ‘01’ and ‘02’ uses three domain constraints (numeral, numeral, numeral), three column constraints (month, day, year), and one relation constraint (dates in the current century), to enable a proper interpretation of these values as “Jan. 1, 2002”. At least that would be the interpretation until the year 2100, when the default meaning could reasonably become “Jan. 1, 2102”. The Relation Predicate for ‘Date’ can be expressed as ‘E(x, y, z), x is a member of Months, y is a member of Days, z is a member of Years’. ‘Months’ and ‘Days’ and ‘Years’ are domains having logical predicates that are further defined, e.g., x is a member of domain Numerals & ‘1<=x<=12’; y is a member of domain Numerals & ‘1<=y<=31’; z is a member of domain Numerals & ‘1999<z<2100’. The Relation Predicate for ‘Date’ might also include a set of conjuncts properly constraining the value of ‘Days’ according to the value of ‘Months’, e.g. if x=1 then y<=31’ & ‘if x=2 then y<=29’, and so on. Furthermore, the Relation Predicate for ‘Date’ might constrain the value of ‘Days’ according to the values of ‘Years’ and ‘Months’ so as to account for leap years, e.g. ‘if x=2 & z modulo 4=0 then y<=28’. The logical conjunction of these constraints define the Relation ‘Dates’ and any data contained therein. In practice, no RDBMS implements an algorithm for creating or capturing Relation Predicates, extensions to the System Catalog to store Relations Predicates, or means to use Relation Predicates for any purpose.
In broad terms, an RDB is a logic-based model of truths asserted about the real world, and the RDBMS is the means whereby that model, and its logic, is manipulated and maintained within the computer's physical reality (and limitations). These truths include discrete, atomic, data elements and combinations established by the RDBMS's designers, builders, and even users. The value of an RDB derives from its capabilities for logic-based recombination and manipulation using the ‘relational model’ and working with and through Relations; that value is significantly and negatively affected by anomalous or non-uniform or unpredictable behavior, and especially as regards updates or other operations on relations.
Current RDB's distinguish between Base Relations and Derived Relations. A Base Relation is one where the RDBMS maintains a direct corollary between the physical organization of the computer's memory and the logical organization of a Set's elements. A Derived Relation is a representation of a Set whose members are logically derived from, and represent a combination from, those members of other Sets that further satisfy the logical predicate that both details the necessary and minimal properties of the derived Set; it will also have (either implicitly or explicitly) both a logical and relation predicate that distinguishes those elements from others which lack those necessary and minimal properties, assertion of belonging to the Derived Set, or both. In practice, a Derived Relation is defined by relational and logical operations on other Relations, any of which may themselves be Derived Relations. A Derived Relation may also consist of data elements who are stored in physically-separated portions of the computer's memory. Derived Relations may be any of several types, e.g., Views (defined below), materialized views, ‘snapshots’, replicas, and query results. Derived Relations are particularly valuable because the assertion of belonging can arise implicitly though the computer's logical recombination and analysis of Base Relations, rather than depending entirely on human input.
There are many ways to combine the rows and columns of Base Relations. Also, a Derived Relation may be defined or created via a relational expression that references any combination of Base Relations, other Derived Relations, or both Base and Derived Relations. In such combinations, each of the referenced relations in the combined relational expression is known as a Source Relation for the combination Derived Relation; the Derived Relation is sometimes referred to as the Target Relation; and the Derived Relation is Dependent upon its Source Relations. Most users, however, deal not with the Base Relations as such, but work from and with their limited, often query-driven, report-driven, or software application-driven view into a RDB.
A View is an named relational and logical expression representing data that is made visible to the user in a form that is usually different from the form of the Source Relations and convenient to a particular use or uses, i.e. it is the user's ‘view into the relational database's contents’. A View has a Relational Predicate (and thereby expression in the relational calculus, relational algebra, and predicate calculus). A View can be understood as a ‘virtual relation’, because the data belonging to a View need not be explicitly stored in the RDB as a distinct table; in fact, a View may represent one or more relational operations on a single relation or on a plurality of relations. The data belonging to a View is derived from data belonging to one or more other relations when the View is manipulated by name in relational expressions, and is transient in the sense that it does not exist if the data belonging to those other relations does not exist.
Views are one expression of a Derived Relation, as stated above. Views differ from other types of Derived Relations in that Views are named virtual relations with a storage-persistent definition (at least until the View is explicitly destroyed or ‘dropped’) and so may be manipulated by authorized users (other than the creator of the View) through reference to that name in relational expressions and at arbitrary times. A Materialized View is a type of View; the data as seen through the View is made storage-persistent and modified only when the Source Relations are modified.
Most RDBMS implementations explicitly maintain and track Dependencies (whether (1) between relations or (2) between groups of columns of a relation), with these Dependencies defined, denoted as symbolic abstractions, and accessible to the RDBMS. In practice, this is usually done for relations as referential integrity Constraints, or View Dependencies, but not between non-view derived relations and their source relations.
For example, the ‘Employees’ Relation (as defined above) and a ‘Departments’ Relation (consisting of Department Number DNUM, Department Name DNAME, and Department Manager's Employee Number MNUM) might be Base Relations. These two relations may be considered to be a Base Set. From the Base Set individual relations can be combined via relational operations to form one or several Derived Relations. A Derived Relation called ‘Managers’ might be defined as consisting of columns Department Manager's Employee Number MNUM, Department Manager's Name ENAME, and the Department Name DNAME of the department managed by the manager. ‘Managers’ is the result of performing a relational join of the ‘Employees’ and ‘Departments’ Base Relations, with the additional Constraint that ‘MNUM=ENUM’. ‘Managers’ is said to have a ‘dependency’ on both ‘Employees’ and ‘Departments’.
‘Managers’ might, for example, be a View. As a named expression, its definition can be stored in memory and can be reused by referencing ‘Managers’ even though the actual rows of ‘Managers’ are created only at execution time, and are based on the then-current rows in the Base Relations (‘Employees’ and ‘Departments’). Alternatively, the definition of ‘Managers’ might be an internal Derived Relation representing a sub-step to a query asking to see all ‘VicePresidents’ wherein the latter are defined as those whose employees are themselves all ‘Managers’; while ‘VicePresidents’ is displayed to the user, the interim Derived Relation of ‘Managers’ may well not be. (Currently, most RDBMS programs do not provide a way to name the Derived Relations that result from runtime query execution).
If the only relations which users of a RDBMS (or computer programs) can access are Derived Relations, then these Derived Relations, either directly or indirectly, form the linkage between the physical location and structure in the computer memory and the descriptive (as expressed, for example, by the conceptual or logical schemas) location and structure in the RDB, handled by the RDBMS. In practice, an RDBMS most often predetermines a significant portion of physical location and structure in the computer memory of Base Relations. If all operations (including access and update) that are valid for Base Relations are likewise valid for Derived Relations, the linkage attains maximum flexibility; it then permits modification of the set of relational expressions which define the set of Derived Relations in such a way as to leave the rows and columns of each of those Derived Relations unchanged, despite structural reorganization of the set of Source Relations (even when those Source Relations happen to be Base Relations) so long as the information necessary to the creation of those Derived Relations is preserved. This property is known as Data Independence and it is intended to be a key value to relational (as opposed to other) databases. It is also, however, badly limited when Base and Derived Relations are not handled in a uniform manner as, for example, when some Derived Relations cannot be updated in the same manner as Base Relations.
RDBMS programs have four fundamental functions that are used to manage all data modification operations on relations; these are respectively Insert, Delete, Update, and Retrieval. The first three of these are used independently. The Insert operation allows new data to be entered into a particular relation. The Delete operation allows existing data to be removed from a particular relation. And the Update operation changes one or more data elements within a particular relation. The fourth function, Retrieval, is used to locate, manipulate, and produce the data in the RDB and may be used either independently or in combination with one of the other three. Other processing (logical, relational, arithmetic, or transformational) may be used to further facilitate changing data, its presentation to the user, or the nature of the RDB. An RDBMS which has Data Independence will allow any of these four functions to take place without the user having to be concerned with the physical storage of the data or with the structure of the RDB. A recognized major goal for all RDBMS designers, users, and creators is increasing Data Independence.
Existing RDBMS programs allow accessing some combinations of derived data in static, report-only views, and allow updating particular combinations of physically stored data; but the current state of the art differentiates between base and derived relations, asserting, believing, or holding that the latter are inherently not updateable. Also, existing RDBMS programs are plagued by unpredictable and non-intuitive failures in updating derived data; these failures can require a ‘rollback’ which, if not performed correctly, can leave the database in an inconsistent state. In practice, the updating of derived data is generally avoided. Additionally, because of this differentiation between base and derived relations, the creation, maintenance, and merging of multiple physical databases, even when logically feasible, is often pragmatically difficult, costly, effortful, infeasible, or just deemed impossible.
Relational databases use data elements and the relationships between them to model a portion of the world. In practice, the data elements are organized at the logical level into relations, and are perceived as such by the user. (Date, An Introduction To Database Systems, 6th Edition, Addison-Wesley, 1995, Ch. 3, p. 52; Addison-Wesley; ISBN0-201-54329-X.) The RDB does not integrate the denotation, expression, and instantiation of a relation such that the model is clearly linked both to the stored tables and the data elements by means accessible to both the user(s) and the RDB or RDBMS. A relation's title (its denotation or referent) is either chosen by the designer or created by the system. Optimally, it should convey some meaning to the user in the manner of a mnemonic. It may have come from an entity-relationship modeling or CASE tool. It may consist of some concatenation of source table titles according to pre-set rules (e.g. the table combining EMPLOYEE and 401 K_PLAN_MEMBERS may be titled EMPLOYEE—401 K_PLAN_MEMBERS). But the RDB and RDBMS currently do not have a direct tie between the relation, its title or denotation, and the logical model, and the denotation is not separably manipulable according to predicate logic as a symbolic abstraction for the relation itself, or as a symbolic abstraction of the manipulation of the data elements and their combination therein. Moreover, constraints, rather than being treated equally as logical predicates are generally referred to simply as constraints, and they may have been defined as relational expressions; they have usually been separately maintained at the users discretion and as SQL “relational” expressions that are used only to preclude updates rather than enable them.
This distinction and lack of functional relationship between denotation (the title), expression (the title as name), and instantiation (the data elements comprising the stored table), prevents effective symbolic abstraction and requires all logic-based manipulation to manage all of the individual data elements, tying the RDB and RDBMS to the computer's ability to manage its physical memory in which those same data elements happen to be stored and represented.
Furthermore, current relational database management systems distinguish between base and derived relations, and base and derived data; that is, between those relations or data explicitly contained in the physically-demarcated memory groupings denoted as the relational database's ‘base tables’, from those contained or expressed by temporary (often query-driven) combinations of the base tables. These temporary combinations are known as the relational database's ‘derived tables’. (Certain derived tables are also commonly referred to in the literature as ‘views’.) This is a self-imposed handicap the field has failed to recognize, due in part to an earlier theoretical error.
This distinction limits an RDBMS's capability to update derived tables (relations or data); limits users' access to derived tables; and can create problems (in the form of difficult, memory- or processor-expensive transactions, or unintended or unpredictable results) for those RDBMS that try to access or update derived tables (some do, some just don't). This distinction also can cause a RDBMS to use extra memory in duplicating base data elements inside multiple tables. Existing methods to manage updates or access to derived tables can create potentially contradictory data sets, creating major problems for the RDBMS and potentially rendering the RDB itself unreliable.
Furthermore, distinguishing between ‘base’ and ‘derived’ tables (and therefore base and derived relations) means that no such RDBMS permits full data independence between a data expression and the memory location corresponding to its physical storage, or uses uniform semantics with all operations, including derived as well as base data expressions. An RDBMS possessing full logical data independence is one in which (1) the descriptive representation of the data in the database can be changed to accommodate additional types of data, supporting new programs that will use that data while still maintaining the existing descriptions for previously-existing programs and users; and, (2) multiple descriptive representations can be provided, each specialized for a particular group of users or programs, each without implying any need to alter existing elements of physical storage subject to the constraint that all representation changes are information preserving. The lack of full logical data independence in turn creates problems with merging relational databases, distributing a relational database over multiple locations, and handling multiple versions of a relational database (either over time or locations separated by message time), which means that users often find new versions of a relational database become non-backward-compatible with the pre-existing version, which defeats one of the principal goals of using a relational database. Furthermore, the lack of uniform semantics for both base and derived relations can cause failures to certain updates, creating extra relational database system maintenance and requiring rollback of transactions.
Few existing RDBMSs provide means to update derived relations; those that do, do so only for an arbitrarily restricted few derived relations (Date & McGoveran, “Updating Union, Intersection, and Difference Views”, Database Programming & Design, Vol. 7 No. 6, p. 46). These means for updating derived relations are very restrictive, are tied to the physical memory usage of the RDB, are inconsistent with those used for base relations, and their use often results in error messages sent to the user of the RDBMS. Users compensate for these restrictions by avoiding the use of derived relations, developing programs to provide update of specific derived relations, or through manual workarounds. For example, IBM's DB2 and Oracle's Oracle 91 RDBMS products do not permit update of any derived relations (specifically Views) when the update's SQL uses the SQL keywords ‘DISTINCT’, ‘GROUP BY’, or ‘ORDER BY’. There are many other restrictions on updating views such as those that are derived via relational aggregation and UNION. Only a subset of those views derived via join operations can be updated by Oracle; DB2 does not support join view updates at all.
No RDBMS products support general update of all non-view derived relations, though some provide partial update support of materialized views, snapshots, or replicas. And, for those which provide some support, that support is extremely restrictive. Despite the need, there are no RDBMS products providing a common and intuitive method by which all relations (base and derived) can be updated (Date & McGoveran, “How old Data Redundancy”, Database Programming & Design, Vol. 7 No. 7, p. 46, July, 1994; Date & McGoveran, “Updating Joins and Other Views”, Database Programming & Design, Vol. 7 No. 8, p. 43, August 1994). Since all RDBMS implementations distinguish between updating base and derived relations, users must learn the particular behavior of the RDBMS for each type of derived relation, and must be aware of and can easily determine whether or not a particular relation that they wish to update is a base relation or a derived relation; and this restriction further violates logical data independence and forms an impediment to physical data independence.
Additionally, treating base relations as stored tables prevents attaining a major goal of physical data independence, that of separating where and how a table is stored from manipulating the logical representation for the table's instantiation. Symbolic abstraction of the logical representation and user requests into relational predicates allows for rapid logical manipulation to be separated from the mechanics of managing the physical memory, which otherwise limit the speed and power of the RDBMS. Currently, an RDBMS at best clumsily handles its own internal representations, lacking means for symbolic abstraction of the model to which it has been designed and built, and which it uses. The lack of such abstraction being available to the RDBMS increases the RDBMS's difficulty in distinguishing between errors caused by logical inconsistencies, data errors, and memory limitations.
As no RDBMS maintains Relation Predicates for the relations or tables in its system catalog, separating out logical and data processing (e.g. for optimization purposes alone) is difficult. Although almost every RDBMS provides support for using constraints in managing and enforcing the consistency of an RDB, no RDBMS uniformly and consistently maintains constraints in its system catalog as Relation Predicates, and makes them accessible to the RDBMS or readily apparent to users. Users, who would benefit from having a uniform method by which to understand the meaning of a table when a particular constraint is applied to that table, are thus liable to misinterpret the data in a table, to access a table with a different meaning than the one intended, or to use a table in a manner inconsistent with its meaning. Each of these may lead to corruption of data when the RDB is subsequently updated, or may cause the user to make incorrect business decisions.
Although SQL uses expressions involving predicates for access and update of relations, no RDBMS provides a uniform and consistent method of accessing or updating relations, in which the semantics or meaning of that access or update is based on and expressible in relational predicates; these might be referred to respectively as an ‘Access Predicate’ and an ‘Update Predicate’. Use of such an ‘Update Predicate’ would also help ensure consistency and ease maintenance for both the RDB and RDBMS, particularly if these were both contained within the scope of, and accessible to, the RDBMS. The operations of the RDBMS would be easier to maintain, optimize, or track if there were means for classifying portions of an ‘Update Predicate’ into one or more relational expressions, each of which either (1) constrains the logical consistency or other effects of the update action, or (2) restricts the data that is to be affected by the update operation, for this classification would help determine how the RDBMS will manage the update.
The continued linkage between physical location in computer memory and descriptive location in the database by the database system, such as found in Iwata, K. et. U.S. Pat. No. 4,514,826, and Matsuda, S. et. al. U.S. Pat. No. 5,247,665, is an approach that, because it is based in whole or in part on information which the RDBMS does not explicitly have access to (an implied structure created and maintained by the administrators, the terms of which are either inaccessible or meaningless to the RDBMS), prevents any RDBMS from attaining either physical data independence, in which the descriptive representation of the data in the database is freed from machine-specific and non-database terms and processes, or logical data independence.
The limited perception that uniqueness properties can be determined for a database was explicitly limited to a 1-tuple condition in Leung, T. et. al. U.S. Pat. No. 5,615,361, because of the separation between a binding explicitly determinable from the database system and that which is actually present in the database's structure. This prevents the user from making changes to the structure, organization, or contents of the database except through indirect database system administration, hinders the database's actual capability to effectively model the information contained within it, and limits the capacity to manage dependent relations or views.
Much of the problem encountered by most RDBMS in handling large databases has been the presence of ‘null’ elements and columns required by any method that does not effectively manage the data to limit unnecessary duplication, due to the inherent limitations of an implicit and non-represented structure. The opportunity for improving database system performance identified in Leung, T. et. al. U.S. Pat. No. 5,590,324 by exploiting column nullability is just a faint harbinger of the improved administrability, performance optimization, and prevention of update failures that can be obtained when logical data independence can be guaranteed. In many cases, support for logical data independence mitigates or removes the need to support column nullability, and therefore lessens and may even eliminate the need for special optimization techniques such as those identified therein when column nullability is supported by the database system.
The apparatus-specific approach in Huber, V. U.S. Pat. No. 4,918,593 for maintaining dependence is explicitly limited to certain derived columns of base tables. It makes neither provision for derived tables nor discusses any generalizable method independent of the specific data dictionary means for maintaining dependence between tables. The present invention makes use of dependence between tables, and need not be maintained via any particular data dictionary means. Huber makes no claim pertaining either to data independence or to a general method for updating relations.
The value of separating logical and physical data structures is evinced in Kingberg, D. et. al. U.S. Pat. No. 5,734,887, which fails in its approach to free itself of the need for explicit tables, for both mapping the logical to physical combinations and the explicit joins between logical entity types and the physical tables and columns under them. It further fails to make the means for such mapping or the representation explicitly accessible to the RDBMS. Kingberg requires the use of a ‘logical data interface’ for access to base relations from application programs without explicitly referencing those relations; the approach does not provide a method for updating derived relations.
Only by using an extra stage of providing a completely separate and independent object model does Kawai, K. U.S. Pat. No. 5,717,924 manage to provide a link between a relational database schema and an object model for the information contained within the database schema. Additionally, the stages of managing and administering any modifications to the database schema are not explicitly described in a fashion that uses the logical structure of the schema, and the constraints and processes contained by the relational database system, to manage the modifications directly.
A different approach to the concept of managing relationships amongst base tables, one that consumes additional memory resources and requires additional programming and data entry, is specified in Olson, M. et. al. U.S. Pat. No. 5,566,333. Olson requires a distinct linker table, does not modify relational database or its contained data, and does not address the problem of updates.
Pitt, J. et. al., U.S. Pat. No. 5,493,671, explicitly duplicates the entirety of any merged data, and deals solely with data type differences by direct conversion according to preset means rather than any methodology contained within an RDBMS.
The desirability of allowing logical access, independent of knowledge of the structure of the physical database, is addressed in Maloney, C. et. al. U.S. Pat. No. 5,701,453. Maloney is limited to table pairings, and the use of explicitly overlapping fields, rather than being generalizable either to logically possible combinations or to any representation explicitly available to the RDBMS.
The value of dynamically displaying and updating data is mentioned in Vanderdrift, R. U.S. Pat. No. 5,455,945; however, in that method the accessible data is limited to the primary or base records, is not derived from any logical representation of the database, and does not use the logical constraints and representations of the database but rather depends upon the creation of explicit management records and memory pointers, and tracing them as necessary, thereby increasing the complexity and memory requirements for the system rather than lessening them through symbolic abstraction. Moreover, the method therein does not provide a method which is consistent over data, relations, and constraints; instead, it distinguishes between a ‘management record’, a function, a filter, and a ‘DD’ (display and organization rules). And the method neither makes the method accessible within and to the RDBMS, nor uniform across data types, nor separate manipulating the data, functions, and records from preliminarily manipulating the logic to determine whether and how the changes are feasible.
The method identified in Horn, G. et. al. U.S. Pat. No. 5,226,158, may assist in determining the validity of a particular constraint; however, it does nothing with such validity or the constraint itself. Nor does the method therein allow for generalization to means for consistently managing base tables, derived tables, and constraints, as well as any particular constraint.
Review of Certain RDBMS Mechanisms
There are many methods in the art by which RDB updates have been implemented. Relational updates are set transformations, as contrasted with row or record modifications. This fact implies that updates are atomic, i.e., an unrecoverable error of any type requires that the entire update be aborted. Typically, updates are applied in the context of a transaction so that atomicity is insured by a transaction manager or some equivalent software component. The usual method by which either relational update or transaction atomicity is insured is to make all updates to a copy of the data, leaving a copy (known as a ‘before image’) unmodified. If an error occurs, the unfinished modifications can be discarded and the RDB restored to its original condition using the before image. If the update completes successfully, the modified copy (known as the ‘after image’) can be used to replace the before image. This technique is often used in a nested fashion so that each update within a transaction has a corresponding before image and after image, as does the entire transaction. Regardless of the particulars of transaction management, the illusion is given that the entire database is transformed from the publicly available version of the data (before image) through a sequence of private after images (each generally hidden from other users) until the transaction completes. If it is successful, the final after image produced becomes the publicly available version of the data. In practice, there may not be a physical after image or before image, but only the appearance of one. Many variations on the method of transaction management exist, but are functionally equivalent to the one described here. See Date, An Introduction to Database Systems, supra, for a more detailed explanation. The after images of tables modified by a transaction are often checked prior to completing the transaction to determine consistency. Such constraint checks may require reading other tables that have not been modified (i.e., have no after image) within the context of the particular transaction.
Methods for processing a request, whether a data retrieval or a data modification, are generally referred to by the term ‘query processing’. The literature pertaining to query processing in an RDBMS is extensive and includes subtopics such as query parsing, internal query representation, optimization, and physical data access methods. A common internal query representation technique is known as a query tree, in which data access methods form the leaves of the tree and successive nodes represent operations on the (possibly intermediate) data. Operations are typically either unary or binary, this being sufficient to represent all relational operators. Every relational request and every predicate formula can be represented by such a query tree as can the definition of every relational view, since a relational view is defined as a named retrieval operation on one or more relations.
A common and well-known technique for processing a retrieval involving a view is to combine the query tree representing the retrieval with the query tree that represents the view definition. In order to use the technique, the RDBMS must maintain dependency information in its System Catalog—that is, information which relates the view to the relations on which its definition depends. Because a view may be defined in terms of relational operations on other views as well as base tables, this dependency information is most naturally stored in the form of a ‘dependency tree’ with leaf nodes representing base tables and nodes above them representing derived tables. Numerous data structures have been used for storing dependency information, many of which are equivalent to dependency trees in the sense that they are capable of storing precisely the same information but differ in the algorithms used to process that information. Some may contain information in addition to dependency information. Dependency trees are often used to process requests involving views, including modification requests. Most implementations provide only limited support for view modification requests. Furthermore, most implementations use dependency information to propagate modification requests as if they pertained to individual rows of the view, or to substitute the defining retrieval in place of each view reference so that the request ultimately attempts to modify only base relations. This well-known direct substitution technique, and its equivalent methods, result in valid modifications only for certain types of views and such RDBMS implementations typically restrict view updates to those for which it is known to be valid.
The meanings of objects in an RDB (domains, columns, rows, base relations, and derived relations) in an RDBMS are most frequently maintained through methods that are distinct from both the maintenance of the RDB (such as the creation of relations and views) and the processing of requests. For example, object naming conventions, separate data dictionaries, “help” systems, and the like may exist that permit the capturing of object definitions, each of which requires manual steps to create and maintain that are distinct from those steps used to create or modify the object. Such definitions are typically human readable, are not used by the RDBMS in processing requests, and over time diverge from an accurate representation of their corresponding operational definitions. All too often, RDB creators and users rely upon object naming to convey meaning, a practice that is unreliable, inefficient, and cannot be used by the Query Language Processing Engine.
Brief Summary of Current Literature in the Field
Research into the problem of updating derived tables has been limited because of a theoretical misapprehension. One of the theoreticians, in 1988, claimed to have proven that updating views was potentially impossible, or at least that any method that claimed to work for all views was subject to an unpredictable failure. Buff (“Why Codd's Rule No. 6 Must Be Reformulated,” ACM SIGMOD Record 17:4, 1988) stated a theoretical proof that a general algorithm for deciding whether or not a view is updateable is undecidable within the predicate logic. This paper has been the dominant and most serious barrier to investigation of the problem of a general algorithm for updating views, let alone arbitrary relations. However, as Buff does not provide a proof of impossibility within the relational algebra, nor show that the relational algebra and the predicate logic are equivalent, he therefore does not address the embodiment of the invention of this application. Also, Buff never considered those limited implementations of the relational algebra which are necessary to reduce the relational model to practice on physical computers; instead, his paper considers solely the pure mathematics for abstract, theoretical algorithms.
One of the co-inventors was previously so persuaded of the non-updateability of views by E. F. Codd (The Relational Model for Database Management Version 2, Addison-Wesley, 1990), in which the author referenced his unpublished algorithm (View Updatability in Relational Databases: Algorithm VU-1, unpublished, 1987) for determining whether or not a view might be theoretically updateable. The referenced algorithm was not, and has not been, reduced to practice, and did not provide any method by which arbitrary views could be updated. Furthermore, Codd does not teach that all views are theoretically updateable, nor does he provide a method by which arbitrary or even specific view updates are to be achieved. Also, the view updates which Codd does describe involve row operations and do not preserve the set semantics of relational operations.
Dayal and Bernstein (“On the Correct Translation of Update Operations on Relational Views”, ACM TODS 7:3, 1982) provided a formal treatment of view updating rules for restriction, projection, and join views only. They did not provide a general method for updating views or arbitrary relations.
Keller (“Algorithms for Translating View Updates to Database Updates for Views Involving Selections, Projections, and Joins”, Proc. 4th ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, 1985) presented criteria for algorithms that would implement a limited class of view updates, and multiple algorithms which satisfy those criteria. A single, general purpose method was not presented (or suggested as even possible), and the semantics of the update operation are not propagated to the base relations.
Nathan Goodman (“View Update is Practical”, InfoDB Vol. 5, No. 2, 1990) proposed that the user, in defining a view, be provided with a means for also specifying view-specific methods of updating. No attempt was made to provide a method by which arbitrary views can be updated; the problem of updating derived relations other than views is not discussed. Goodman did refer to well-known methods of updating a few particular types of views using type-specific methods which he recognized as not generalizable. He also identified types of view which he contended required user-defined and type-specific methods for updating, denying the possibility of a generalized algorithm.
Since the Nathan Goodman article, most of the literature on “view updating methods” refers to the propagation of updates from one or more source relations to a physically stored derived relation, and how to most efficiently manage physical aspects of this operation. This has generally been referred to as the problem of updating or managing ‘materialized views’. It does not address the problem of updating a derived relation and then propagating the appropriate changes to the source relations; therefore, this body of literature does not bear upon this application.
The ANSI (American National Standards Institute) has published a standard for the syntax and some semantics of the SQL query language; this query language is the one which almost all RDBMS products support. The current (and forthcoming) version of the ANSI SQL standard states explicitly that expressions involving updates of views are not legal expressions in the language except in a limited number of specific cases. The semantics described for updating those limited types of views are, in general, inconsistent with the semantics of updating base relations, resulting in a surprising and non-intuitive behavior from the perspective of users. RDBMS products that support SQL have been required by market pressure to support the syntax and semantics defined in the ANSI SQL standard; the ANSI SQL standard has been and continues to be a barrier to developing (let alone implementing) approaches for general view updating.
C. J. Date (An Introduction to Database Systems, 7th Edition, Addison-Wesley, 2000, Ch. 9, p. 297-325) describes separate updating procedures for each of certain types of views, but fails to introduce a general approach to updating all relations, whether base or derived; the possibility of updating certain types of views is explicitly denied. Also, Date provides separate procedures for various types of updates (for example, insert, delete, or modify). The limits on view updatability imposed by the ANSI SQL standard mentioned above are discussed, which may further have seemed to validate a mistaken belief in the non-updateability of views.
There is a need for maintaining and tracking, preferably by a symbolic abstraction such as by means of relation predicates, the relationships or dependencies among a derived relation and its source relations, so when a source relation is changed the derived relation is also updated. Also needed is a means to derive a relation predicate for a derived relation from the combination of relation predicates for its source relations, predicates for constraints on those relations, and the predicates for the relational operations on source relations used to define the derived relation; once derived, it would be further desirable to make the same accessible to the RDBMS and its programmers or even users. Also desirable would be means to decompose a relational expression involving a derived relation into a logical combination of one or more relational expressions, each of which is either a relation predicate of a source relation or a predicate corresponding to a constraint on one or more source relations. Such means should permit successive decomposition of a relational expression, so when the result of one step of decomposition generates one or more relational expressions that themselves involve a derived relation, each of these is further successively decomposed, leading finally to a logical statement whose every element is either a relation predicate of a base relation or a predicate corresponding to a constraint on one or more base relations.
What is needed is a common and uniform method that can (i) provide uniform symbolic abstraction of data, relations, and constraints comprising an RDB managed by an RDBMS, (ii) allow both users and the RDB and RDBMS to use the most effective of either logical manipulation of the symbolic abstractions or manipulation of the same symbolic abstractions' instantiation to reason with and manage data elements and relations, and (iii) provide access to or an update on an arbitrary relational expression as a symbolic abstraction and thence on the physically-embodied data and relations for which the symbolic abstraction stands, whether the data and relations referenced by that expression are views, other types of derived relations or base relations.