The present invention relates generally to information management and, more particularly, efficiently storing and updating property data in a relational database.
Speaking broadly, a property-based storage system is one that allows arbitrary properties to be attached to a document. Properties are named attribute/value pairs. Documents may be any kind of electronic entity, such as files, application objects, etc. Perhaps the most straightforward way to implement a property-based storage system is to store all properties for all documents as separate entities. For example, each property might be stored in its own row in a relational database table. (This approach has sometimes been termed xe2x80x9cunstructured storage,xe2x80x9d or a xe2x80x9cvertical schema,xe2x80x9d in the art.) One advantage of this approach is simplicity. However, this approach may suffer from serious performance penalties in some common workloads.
Another approach is to store added properties separately from an initial set of properties. An external binding interface (such as the Typelib interface in OLE Automation(trademark) from Microsoft Corp.) can be used to associate arbitrary attributes with individual object instances. In this approach, the initial set of properties can be stored in some optimized way, however, accesses to added properties suffer from performance penalties since their storage is not optimized. A complete discussion of using properties for uniform interaction in a document system can be found in, Paul Dourish, W. Keith Edwards, Anthony LaMarca and Michael Salisbury, xe2x80x9cUsing Properties for Uniform Interaction in the Presto Document System,xe2x80x9d Proceedings of the 12th ACM Symposium on User Interface Software and Technology, ACM Press, New York, November 1999, pp. 55-64.
Object Persistence Systems
An object persistence system is a means by which a software developer can cause objects created by an application to remain accessible after the application terminates. One can distinguish between object persistence systems and systems that simply provide object-oriented interfaces to records in existing databases. Examples of the latter include DBTools.h++(trademark) from Rogue Wave Software. Object persistence can be implemented in at least two different ways. First, persistence can be achieved by storing the application objects in an object-oriented database. One advantage of using an object-oriented database maybe that the amount of translation work is small so that the application objects are usually stored in some fairly straightforward manner, e.g., by serialization or virtual memory mapping.
Second, persistence may also be achieved by using object mapping, a layer of software that maps the developer""s objects into records in an underlying, non-object-oriented database. While the amount of translation work is typically higher than that imposed by an object-oriented database, the object mapping approach can take advantage of the scalability and/or feature-richness of the (typically more mature) non-object-oriented database systems.
Object persistence implies that the developer (at least potentially) has control over the object/database mapping, i.e., how the data is broken down into records within the underlying database. This mapping is a fundamental intellectual problem underlying object mapping systems.
Because of the prevalence of object-oriented programming, many object mapping systems have been developed. These systems allow the user to specify an object model, from which a database schema (typically relational) is generated. The user is usually given the opportunity to change this schema to reflect their needs (e.g., using denormalization to improve query performance). Examples of commercial object mapping systems include: Java Blend(trademark) from Sun Microsystems, JDX(trademark) from Software Tree, and Extreme Object Persistence Service(trademark) from Secant Technologies. However, such systems fix the mapping when the database is built; unlike the present invention, additional properties cannot be (easily) added later.
Repository Systems
A repository system stores application metadata as persistent objects. For example, the repository might expose the schema of a bank""s transaction-processing database in a manner that is understood by the bank""s software development tools, thereby eliminating some manual work on the part of the bank""s application programmers. Repository systems typically implement versioning because they support processes such as software engineering and database schema design that undergo many stages of evolution. Since they must support many different classes as well as versions of these classes, they must address the problem of storing large databases consisting of many, relatively small collections of objects in which the objects in each collection have properties different from those in other collections. This problem resembles (in scale) the problem of managing objects with arbitrary properties. Therefore, systems such as Microsoft Repository(trademark) from Microsoft Corp. provide mechanisms by which the user can optimize the storage of sets of properties in tables. However, unlike the present invention, Microsoft Repository stores all of a single class""s properties in a single table (and then combines multiple classes into the same table if the user believes they will be used together).
The ability to set arbitrary properties on documents has been supported in document management systems. The storage of properties as individual rows in a relational database system has also been done before. Also, the storage of partially structured data in tables whose rows contain columns corresponding to property values, and partly in tables whose rows contain individual property values is known. There is a need for a flexible and dynamic system for mapping from a document management system to a relational database.
A system, method and article of manufacture are provided for fast mapping from a document management system to a relational database. A database is provided having a plurality of tables relating to a plurality of property groups. Each property group in the database has one or more properties associated therewith. In applying a property group to a document, the application or programmer applying the group is entering into a contract with the document management system that all of these properties will exist on the document and that the types of the values will be at least (in the object-oriented sense of a type relationship) the types mentioned. In the database, each property group is mapped to one or more tables which store the values of properties in that group. When a document having one or more properties is provided, the properties of the document are then mapped to those tables in the database that include one or more property groups that has been determined to apply to the document.
In an aspect of the present invention, each property group may have a set of columns in a table in which the respective properties are stored. Possible states of where a property group is mapped to one or more tables include: the property group residing by itself in a single table with one or more columns for each property, the property groups being split across multiple tables with one or more columns in each table, or a property group residing in a single table shared by one or more other property groups.
In an embodiment of the present invention, when the document management system receives an additional property group (having a set of properties associated therewith) to be added to the database, a determination may be made as to whether the additional property group has any properties in common with the preexisting property groups of the database. If it is determined that the additional property group has no properties in common with the preexisting property groups of the database, then a new table maybe created in the database relating to the additional property group.
If, on the other hand, it is determined that the additional property group has at least one property in common with at least one of the preexisting property groups of the database, then one or more of the preexisting tables in the database may be modified to accommodate the additional property group. In one aspect of the present invention, the modification can involve adding columns for the additional property group to each preexisting table that is related to a property group in the database having at least one common property with the new property group. If the additional property group has been mapped to more than one table in the database because of sharing common properties with two or more preexisting property groups, then all of the modified tables for the additional property group may then be merged into a single table in the database.
In one embodiment of the present invention, partial loading can be performed upon receiving a query for retrieving a document having one or more particular properties associated therewith. A determination may then be made as to which of the property groups includes the one or more queried properties in their set of properties. All of the properties of the document that are part of the property groups determined to include the one or more queried properties in their set of properties may then be retrieved from the database. If the one or more properties of the query are determined not to belong to any of the property groups of the database, then all of the properties of the document that are not mapped to any property group of the database may be retrieved from the database.