1. Field of the Invention
The present invention relates to a technique for logically and physically clustering tuples of data in a relational database by partitioning (declustering) a set of relations into smaller local relations and reclustering the local relations into relational constructs referred to herein as domains. By so clustering the tuples of data into domains including data related to one or more objects, data from the database may be more efficiently cached, locked and copied during access by user application programs.
2. Description of the Prior Art
Before database management systems were developed, user applications programs managed data stored in application-specific files. Because data common to different applications was kept in separate files, data consistency between applications was difficult to ensure. Database management systems were thus developed to simplify the job of developing applications and ensuring data consistency. In particular, database management systems have been used for arbitrating the sharing of files among users, ensuring data integrity and recovery when problems occur, distributing data in a network, managing searches through large amounts of data, and other similar functions. However, since early database management systems were used primarily by transaction-oriented users, such systems were designed to support alphanumeric data formatted into record-oriented files. These early systems were thus limited by the direct-mapping characteristics of record-oriented databases.
However, different users of traditional database management systems have recently developed very different needs for their systems. For example, systems for electronic design automation, engineering information management, engineering test and measurement, telecommunications, office automation and hardware and software design have been developed where it is desired to model the types of information generated by such systems through all phases of product design. Traditional relational database management systems were not well suited to such uses. Computer language primitives, such as a word or symbol, of the data model could be used to represent real-world objects and the relationships among them. In other words, data objects were represented as record types and their attributes given in fields within a record. The relationships were then modeled by placing key values in separate but related data records. These keys could then be used at runtime to join the separate data records so as to recover the object relationships. Thus, rather than seeing the world as composed of records, these "object-oriented" database management systems viewed the world as made up of "objects" i.e. entities defined by their functional characteristics.
Unfortunately, these "object-oriented" database management systems were also limited by their direct mapping of the data objects into particular record types. Tuples, or records, in the relational database model were typically represented as flat collections of fields including, for example, the name of the data object, its value, and its connections. The collection of fields could not handle structured attributes, such as a component hierarchy, when several different types of data objects and interconnections were to be modeled in the database. The problem with such mapping of real-world relationships of data objects into tuples is that the relational model uses a single primitive for handling both an object and its relationship with other objects. This approach has been shown not to work in all cases. Accordingly, such systems have required a set of conventions to be learned by the programmers who use the database. Also, such systems require the data elements to be joined in the program memory to reconstruct the relationships of the data objects. This causes the relational database to slow down such that it is generally too slow for interactive applications.
Object-oriented database management models have been designed to overcome these problems with traditional record-oriented database management systems. The object-oriented model is based on defining and understanding the relationships between objects. In such systems, the objects pass data back and forth, and to define the relationship, the nature of the data, rather than the actual data, is examined to understand how an object uses it. Preferably, such systems maintain the integrity of object relationships declared by the user. A general description of the function of object-oriented databases can be found in an article by Atwood entitled "The Case For Object-Oriented Databases" IEEE Spectrum, February, 1991, pp. 44-47. The difference between such an object-oriented database system and a traditional relational database system is depicted in FIG. 1.
As shown in FIG. 1, records 100 are used in a relational database system to define an object in a computer's local memory 102. The separate records 100 typically must be translated and linked (104) with a relational database 106 having records 108 stored therein. By contrast, in an object-oriented database system, the data object relationships of the records 110 are preserved as they appear in local memory 112 so that addresses need only be mechanically adjusted, or swizzled (114), to provide ,full access to the database representations 116 and 118 in the object-oriented database 120. Object-oriented database systems use swizzling (translation) to reduce the speed penalty associated with direct mapping of main memory pointers to secondary memory pointers in the database. However, such object-oriented databases have not yet become widely accepted.
An object-oriented database management system using beneficial features of both relational and object-oriented database managements systems has recently been disclosed by Wilkinson et al. in an article entitled "The Iris Architecture and Implementation", IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990, the contents of which are hereby incorporated by reference. As described therein, the Iris system is based on an object and function model where attribute values, relationships and behavior of objects are modeled by functions. In other words, the system architecture efficiently supports the evaluation of functional expressions so that a database management system can be provided which is powerful enough to support the definition of functions and procedures that implement the semantics of the data model. For example, retrievals and updates to the database are written as functional expressions. Users may define new functions which may be implemented as stored tables or derived as computations which may be expressed either as system functional expressions or as functions in a general-purpose programming language such as C. The Iris system allows new operations to be easily prototyped because data model operations can be prototyped as ordinary database functions.
The Iris data model contains three important constructs: objects, types and database functions. Objects in Iris represent entities and concepts input by the user application program. Literal objects such as integers, strings and lists are self-identifying. Surrogate objects, on the other hand, are represented by a system-generated, unique object identifier (OID). Surrogate objects may include system objects such as types and database functions as well as user objects such as individuals and data associated with those individuals. Types have unique names and are used to categorize objects into sets that are capable of participating in a specific set of database functions. The objects are used as arguments to the database functions and may be returned as the results of such database functions.
In Iris, attributes of objects, relationships among objects, and computations on objects may be expressed in terms of database functions. Iris database functions are defined for different types and may have many values. A function is compiled into an interpretable runtree which may be recursive. A type can be characterized by the collection of functions defined on it. In Iris, a new function is declared by specifying its name, the types of its argument and result parameters and, optionally, names for the arguments and results. Generally, the properties of objects are modified in the Iris system by changing the values of functions. Functions with values stored as tables can always be updated. However, functions whose values are computed may or may not be updatable. More details regarding the declaration and implementation of such functions can be found in the aforementioned article by Wilkinson et al.
The Iris database management system is unique in that although it is object-oriented, the database is driven by an extended relational database engine as shown in FIG. 2. In other words, all object-oriented inquiries to Iris are changed to relational constructs and stored as tables. For example, the relational operators such as the trees describing the relationships between the data objects are converted to relational commands so that the operator of Iris need only call a function to perform the desired manipulation of the data object. These functions may be mapped to a table in the relational database. FIG. 2 illustrates the kernel architecture for an Iris system which provides for implementation of such functions.
As shown in FIG. 2, the Iris kernel is organized as a collection of modules which are accessed by the user through a client application program 200. The top-level module, the Executive 202 (EX), implements the kernel entry points and manages the client-kernel interaction. For each request, the Executive 202 calls the Query Translator 204 (QT) to produce a relational algebra tree for the request. The Executive 202 then passes this tree to the Query Interpreter 206 (QI) which produces the resultant expression. The Object Manager 208 (OM) is a set of system procedures and functions which are implemented as separate functions written in some general-purpose programming language and compiled outside of Iris. On the other hand, the Cache Manager 210 (CM) is an intermediate layer between the Iris kernel and the Storage Manager 212 (SM) of the database. Cache Manager 210 provides prefetching and cache management for data retrieval and data updates between the kernel and the Storage Manager 212 of the database. The Storage Manager 212 provides data sharing, transaction management and access to stored tables.
A copy of the Iris kernel is provided to each user accessing the database. The kernel may execute as a server in a separate process and communicate with the user's application programs via messages. On the other hand, the user and the kernel may be tightly coupled in the same process and communicate via subroutine calls. In either case, the configuration is transparent to the source code of the user. Storage Manager 212 always executes in the same address space as the kernel, while multiple instances of the Storage Manager 212 use a shared memory buffer for caching data, concurrency control and transaction logging. More details regarding the other elements of the Iris kernel can be found in the aforementioned article by Wilkinson et al.
As noted above, one feature of the Iris system is that it caches data between the Iris kernel and Storage Manager 212 using Cache Manager 210. Cache Manager 210 caches tables, not functions, and maintains a tuple cache which may cache tuples from individual tables. A table of the database may have at most one tuple cache, where the tuple cache is accessed via a column of the table and that column is either uniquely-valued or many-valued. If a column is many-valued, the tuple cache ensures that whenever a given value of that column occurs in the cache all tuples of the table with the same column value will also occur in the cache. This guarantees that when a cache hit on a many-valued column occurs, the scan can be entirely satisfied from the cache without having to invoke the Storage Manager 212 to keep information in the cache consistent therewith.
The performance of the Iris system is very dependent on its cache. Unfortunately, in the Iris system the input tuples are assigned to tables which are randomly assigned to pages. Accordingly, data clustering of the tuples of data onto the same page cannot be controlled. Efficient caching is thus not possible. The retrieval performance from the cache could be significantly improved if Iris cached function values rather than tables since a function call could be directly evaluated without compiling it into a relational algebra tree. However, updates to stored tables pose a challenge since a single tuple of data in a stored table may contain values from many functions, such as when functions are horizontally clustered together in one table. Thus, the individual function caches would need to be located and updated in an efficient manner. For example, since certain system functions are frequently accessed together, most of the functions for a particular type of object may be horizontally clustered on the same table and the table cached when a user application program requests the stored functions.
Generally, database systems provide caching to allow user application programs to access data without having to access secondary storage. However, with the advent of distributed (client-server based) systems, the problems associated with caching data have compounded in that, for performance reasons, it is desirable that users keep their own data cache. Such caches need to be synchronized with the data in the underlying database, which generally requires the use of locking strategies to allow users to update their caches when appropriate. Prior attempts to solve such problems have been based largely on associating locks with physical storage constructs. Particularly, in the case of relational database systems, locks have been associated with tables which store the extent of a relation, with pages in a table, or with individual tuples in a table. However, each of these approaches has problems in that the granularity of the lock is either too high or too low or it is difficult to know the extent of the data that is actually locked. Also, locks are typically transparent to a user application program, making it difficult to control the locks. Accordingly, it is desirable to develop a caching system for object-oriented data in the database such that the data can be cached efficiently without excess data locks and with the desired granularity of the lock.
It is further desirable that a database caching system be developed which prevents deadlock. Deadlock occurs when a user application program needs to access data that is locked by another user application program. When deadlocked, the transactions must be aborted and restarted. Typical distributed applications in the area of engineering information management and discrete manufacturing and the like interact with components that are not under control of a transaction manager (e.g., a user editing a diagram), making it difficult to restart transactions. This poses an important problem with regard to granularity of locks, i.e., the unit of data that is to be locked. In particular, if the granularity is too high, concurrent access is reduced and the probability that a transaction is aborted increases because of the higher probability of conflicting locks. On the other hand, when the granularity is too low, the overhead of locking becomes prohibitive. In addition, such applications typically require long transactions, something for which a typical transaction manager is not very well suited. Moreover, the capability of restarting a transaction depends on being able to capture all the side-effects a transaction can have, which becomes increasingly difficult if the transactions are of longer duration and/or involve interaction with parts of the system not under control of the transaction manager.
Accordingly, it is desired to develop a technique whereby caching in prior art database management systems may be improved so as to increase concurrency. It is also desirable that functions be applied to cached data such that, for example, a whole relational tree of tuples can be copied. In general, it is desired that a system be developed whereby database information may be clustered to optimize common access patterns so as to improve database performance. However, since clustering in relational database systems is typically done on a per relation basis, clustering of data has not been possible since the different relations are stored in different tables. This problem should also be overcome so that related data may be clustered and used by the user application program. Moreover, it is also preferred that such a technique be extended to allow versioning, which is control over multiple versions of product design data, and to provide configuration control, which is the management of the resources so that any version of the data can be recalled when needed. Preferably, a technique can also be developed for providing configuration management for versioning sets of objects. The present invention has been designed to meet these needs.