1. Field of the Invention
Providing a method to facilitate system integration and application/solution development for heterogeneous information systems is valuable. It is also valuable to have a re-usable tool to generate application-specific programming interfaces (APIs) and utilities for loading and accessing heterogeneous information.
This invention relates to an improved method of handling heterogeneous information.
Except for limited cases, it is almost impossible to design a generic database that is suitable for all digital library applications. Thus, a replicable digital library solution would not be able to offer a generic "library", and specific data loading and access software has to be developed for/by each customer.
This invention is directed to a re-usable tool which generates application-specific software for each digital library application. This should significantly reduce costs.
2. Description of Related Art
System integration and application development are major undertakings for building heterogeneous information systems such as digital libraries. A digital library application typically handles a large amount of both structured information (e.g., bibliographic data, catalog data, structured documents, business data) and unstructured information (e.g., image, text, audio, video). To leverage off-the-shelf technologies, each form of data is usually managed by a separate, specialized resource manager. For example, a database management system (DBMS), such as DB2 (.TM.), may be used to manage structured data; an object repository system, such as ADSM.TM., may be used to manage image and text; a stream-data server, such as TigerShark (.TM.), may be used to manage audio and video.
To manage these data properly for a digital library application, a customized data model is frequently required, involving application-specific tables, attributes, structures, relationships, constraints, semantics, and optimization. In many cases, a digital library application is an extension of a customer's existing database and production application. In other cases, it is a component of the customer's overall information technology vision. Thus the data management requirements can be much broader than those of the digital library application alone. For these reasons, the data model requirements are often different even between two similar digital library applications within the same industry.
In the publishing industry, for example, a publisher typically designs its own proprietary database to maintain its bibliography and content data for producing new, electronic products. There are also reported cases that different organizations within a large enterprise require different metadata on the same data. Therefore, it is not possible to pre-design a fixed data database that can support all digital library applications, except for the case where a relatively simple and generic model is sufficient, for instance, VisualInfo (.TM.).
Without a common data model, software vendors/developers are not able to produce re-usable software, namely applications, middleware, tools, or utilities, that access a large amount of information efficiently. Although it is sometimes possible for an application to dynamically "discover" the data model from a "bootstrap" model, the performance of such an approach would not be acceptable and the restrictions would be severe. Furthermore, for a DBMS that supports query compilation, e.g., DB2 (.TM.), a target database is needed for software compilation and it must be distributed together with compiled software.
Even if a common data model is possible, the model would mask the underlying resource managers thereby preventing a full utilization of the resource manager capabilities. For instance, version support in ADSM (.TM.) for objects and retention management capability. In fact, the common data model would "freeze" the data management technologies, preventing further exploitation of new capabilities in the future. In theory the higher-level data model can be extended when an underlying resource manager is enhanced. This is not practical because of the multitude of many resource managers, and in fact it is not always possible because the higher-level model would not be able to reflect all lower-level capabilities. For this reason, many application developers and system integrators prefer using the application programming interfaces (APIs) of the resource managers directly, especially standardized API such as SQL.
Moreover, an essential operation for a digital library, (and for many other heterogeneous information systems) is to load information into the library. Typically performed by authorized workers, this operation is frequently high-volume, batch-oriented and performance-sensitive. It usually requires a proper coordination among the separate operations against the underlying resource managers in order to avoid inconsistencies. Such coordination is similar to the data synchronization required for distributed data processing, in which case techniques such as "two-phase commit" are well-known. However, most resource managers used by a digital library do not have a two-phase-commit capability.
On the other hand, a rigorously synchronized operation that is required for on-line transaction processing (OLTP) is not necessarily appropriate for digital libraries. For example, to protect against failure during batch updates (e.g., loading data), a restart capability relying on redundancy available outside the digital library system (e.g., content source files) can be equally effective but much more efficient than a conventional transaction-rollback followed by a rollforward using a complete transaction log.
Asynchronous operations are not only acceptable but also frequently preferred. The following are a few motivations:
1. The DB2 (Version 2) Load Utility, which does not allow record-level synchronization, is much more efficient than individual insertion of records. PA1 2. Full-text indexing of text objects is usually much more efficient if performed in batch (asynchronous with object insertion) than performed individually (synchronized with insertion). PA1 3. Synchronous indexing of text objects also leads to long DBMS transactions which degrade DBMS performance due to locking. PA1 4. Recoverable deletion (required to support transaction rollback) of a large object can be very expensive unless the resource manager provides an efficient support. Most object repositories, such as ADSM (.TM.), do not. On the other hand, non-recoverable deletion is acceptable for many digital library applications. PA1 5. For ADSM (.TM.), retention management can be used more efficiently and effectively to delete old "versions" of objects than to delete them individually and explicitly.
To support asynchronous, but coordinated, operations, a multi-state consistency model is usually a better transaction model for a unit of work than the binary model ("all done" or "all not done"), which is appropriate for OLTP. On the other hand, the "nested transaction" model that is suitable for engineering design and other long-duration applications is not sufficient for digital libraries, since there is often no pre-determined ordering of the coordinated operations, and furthermore, parallelism is preferred when possible.
Besides asynchronous operations, many digital library applications actually have special consistency requirements (e.g., whether "orphan" objects are allowed) and operational requirements (e.g., whether inserting an already existed object constitutes an error, and how to handle such a condition). To fit all these requirements into a fixed paradigm of transaction and constraint, if this is possible, many artificial work-arounds for resource managers would be needed. Furthermore, data loading is an integral part of the content creation/capture/import process, which undoubtedly varies with each application because of the diverse content sources and creation/capture tools. While some applications load data from files, others prefer data loading from buffer (e.g., after performing image enhancement, watermarking, compression, or encryption). Still others need to import removable media (e.g., CD-ROMs) with ready-to-use contents that are either too costly to copy (namely, load into the digital library storage) or can not be legally copied due to copyright constraints.
Because of these many dependencies on the application, custom software is usually needed for accessing digital library data. However, transaction management requires systems skills that many application developers (who typically focus on information capture and distribution) are reluctant to invest in. Moreover, the developers also need working knowledge to handle any unique feature or constraint a resource manager may have. For example, ADSM (.TM.) requires a transaction COMMIT after every deletion or after a certain number of insertions. This requires special treatments to maintain a coordinated transaction and to accomplish a rollback.
To simplify application development, a common approach is for a system or middleware developer to provide an API that hides systems logic and subsystem interfaces. Lacking a common data model and common transaction semantics, it is difficult to define an API that is suitable for many applications. Although in principle an API can continually grow to become "more complete". This is not feasible since there are unlimited number of cases to consider, and in the meantime the API becomes increasingly more expensive to maintain, harder to use, and creating a bigger compatibility burden down the road.
Without a way to produce re-usable software to access (load, update, retrieve, delete) data stored in a digital library, except for the limited case where a generic data and transaction model is sufficient, custom software has to be developed for each application to coordinate resource managers. This process is expensive and time- consuming and it requires some systems skills.