In a complex technological society, there is an ever-increasing need to store and retrieve information "globally", i.e., so as to allow access and use by a number of different societal entities. Because information is often accumulated in various. geographically dispersed sites, however, it is extremely difficult to synchronize and coordinate the information stored in dispersed sites. Therefore, traditional information processing systems rarely make information available on a global basis, without first centralizing it.
In response to the growing demand for a more efficient means to globally share information that is geographically dispersed, distributed computing systems have become a much more attractive means of information processing. A "distributed computing system" can be defined as a collection of multiple autonomous processors, usually with data stored in associated databases, typically located in geographically remote sites, that are inter-connected by data communications links.
Because of technological advances in communications and microelectronics, as well as a decline in hardware costs, distributed computing systems have experienced prolific growth in the last decade. Distributed computing systems are now being utilized in complex system design and application-oriented issues, including such well known examples as automated teller machine networks, airline reservation systems, and on-line validation of credit card transactions.
While substantial research has been devoted recently to distributed computing systems, much work remains to facilitate efficient data storage and retrieval, especially in view of the problems resultant from storage in geographically dispersed databases. One of the most difficult problems associated with distributed computing systems is the heterogeneous nature of the multiple processors or databases.
The autonomous processors in a distributed computing system can be homogeneous or heterogeneous. "Homogeneous" processors or databases are of the same kind, with the same data structures, and utilize the same data communications protocols. On the other hand, "heterogeneous" processors or databases are of different kinds, with different data models or structures, and they generally do not share information. Therefore, problems associated with intercommunication between heterogeneous processors and databases are much more complex and difficult than with homogeneous processors or databases.
Because the application programs for various known data processing systems are usually developed to meet the specific needs of different groups of users and without regard to compatibility with other data processing systems, most existing database systems are heterogeneous. Therefore, there is a general lack of coordination between heterogeneous databases, often leading to the duplication of data as well as a lack of data consistency among the files of different users.
Application of Distributed Database to Health Care Industry
A good example of a heterogeneous database environment is found in the health care industry. The health care industry is comprised of a wide variety of interrelated organizations, such as hospitals, insurance companies, health maintenance organizations (HMO's), testing labs, utilization review firms, and insurance payors and administrators. Many hospitals and hospital-management companies manage and run their own data processing systems, which often do not communicate between systems within the same company, let alone with systems of unrelated organizations. Some health care organizations run the same type of computer system at different geographical sites, and are therefore homogeneous in this respect, but cannot communicate the data between different sites. This lack of distributed homogeneity results in isolated homogeneous "islands" of information.
Even among organizations with homogeneous systems, it is possible that different entities (e.g. different hospitals that treat the same patient at different times) will store different information about the same person, or may store the same information using a different key identifier. Such occurrences introduce a degree of heterogeneity into a generally homogeneous computing environment.
A heterogeneous database environment is even more problematic than a homogeneous environment, and also does not provide optimum information processing to the health care community. For example, a given person may be a patient at more than one hospital during a given period of time. The identity of that person is a global fact--it is the same person that visits the hospital, although at different times, perhaps, and with different maladies. Because information about that person is entered and stored in more than one data processing system, there is duplication of information, and there is the undesirable possibility that inconsistent data will be accumulated about a particular person.
It is desirable that remotely located autonomous databases interact with others to share information. For instance, an insurance company may want to access patient records found in a hospital database. The hospital may likewise want to access information found in the insurance company database, such as whether and to what extent a particular patient has insurance coverage. Furthermore, it may be useful, or even life-critical, for information acquired at one hospital to be provided to the other, for example, the fact that a person is known to be allergic to certain medications.
Currently, such information is typically exchanged manually, e.g., a hospital administrator may telephone a patient's insurance carrier to determine information relating to the patient's insurance benefits. Alternatively, an insurance company may maintain a dedicated terminal at the hospital for remote access to the insurance company's database. Both of these methods, however, lack the level of automation necessary to support the global exchange of medical information across multiple heterogeneous databases. In addition, these known methods are error-prone since they do not provide a system for tracking the success or failure of information processing, such as adding and updating information in the global system.
As alluded to above, under conventional systems of information processing in the health care industry, it is often the case that information about a given patient is stored in multiple locations. For example, both a hospital and an insurance company may have the same information stored in their computer databases for the same patient, such as his or her address, telephone number, birthdate and other demographic information. The duplicative storage of information in autonomous databases is inefficient because it requires the expenditure of extra resources (in terms of human effort) to enter the information twice. There is also an increased probability of error because of the potential for inconsistent updating of information. Thus the risk that a user may access and rely on old information is greater, a situation that could be particularly dangerous or even life-threatening in the health care industry.
Accordingly, there is a need for methods and systems that provide for the global exchange of medical information within a health care community or other similar environment. The systems should provide a seamless interface between a plurality of remotely located, heterogeneous databases and a corresponding homogeneous data model so as to allow the retrieval and storage of information on a global basis. Such a system should also provide a mechanism for monitoring the status of the data within the system to ensure that users have access to the most current information available on the network.
Prior Art--The Galaxy Distributed Operating System
One approach to certain problems with distributed database systems is that in the Galaxy distributed operating system, which employs an object-oriented database model. See Sinha et al., "The Architectural Overview of the Galaxy Distributed Operating System", READINGS IN DISTRIBUTED COMPUTING SYSTEMS, IEEE Computer Society Press (edited by Casavant & Singhal, 1994), p. 327. (There is more discussion about object-oriented programming methodologies hereinbelow.) In the Galaxy system, a mapping table, also called an "ID table", is utilized for object locating, where each entry in the mapping table consists of locating information for an object. An ID table entry (IDTE) contains information about the type of the object, an access control list for the object, locations of the object's replicas, and locations where the copies of this IDTE exist (called a copy list). The replica list helps in returning all the locations of the desired object as a result of an object-locating operation. The Galaxy system uses the copy list to link together all IDTE's of the same object so that any modification can be made consistently to all copies. Given an object's ID, the Galaxy system can know that object's physical locations by searching the given ID in the ID table, and extracting the physical locations of its replicas from the ID table. However, choosing the method of maintaining the mapping table has proven to be a difficult task.
Specifically, object-locating mechanisms that were proposed and rejected in the Galaxy system include broadcasting, hint cache and broadcasting, chaining, centralized server (in which the entire ID table is kept on a single node), and replication (in which the entire ID table is replicated on all nodes). However, the Galaxy system designers apparently decided that all these mechanisms suffer from one or more common limitations of poor reliability, poor scalability, and poor efficiency. Thus, the Galaxy system uses a mechanism unlike any of those mentioned above. Rather, the Galaxy system keeps on a particular node only the locating information for those objects that have some possibility of being accessed from the concerned node.
It is clear from the literature, that the Galaxy operating system is optimized for a homogeneous data model, since the architecture chosen for maintaining ID's in a directory located in each particular node can only operate for objects that can be accessed from the concerned node, which implies homogeneity.
Object-oriented database models incur other difficulties as a result of global object identity and object sharing. Global object identity is believed by those skilled in the art to be expensive because of lack of global virtual address space. The article by Ozsu and Valduriez, "Distributed Data Management--Unsolved Problems and New Issues", READINGS IN DISTRIBUTED COMPUTING SYSTEMS, supra, mentions this on page 531. The literature suggests that managing distributed shared objects is difficult, since inconsistencies can occur when a shared object is moved and updated at another site. Solutions suggested in the literature include the use of an indirection table or the avoidance of problems at compile time.
According to the literature, with the indirection table approach, the problem of distributed "garbage collection" is purportedly an open problem. However, the notion of garbage collection implies that a given object only exists in a single instance, and if outmoded (as when updated) the object is transformed into a different conceptual entity, with a different object ID. Thus, one would discard an outdated object in favor of a newer, updated object, necessarily having a different object ID. But this approach suffers from the disadvantage of excessive use of object identifiers. Excessive use of object identifiers is inefficient and costly, and is conceptually flawed since there is generally no need to discard an object merely because its data is outdated. It would be more efficient from a resources standpoint to maintain an object's identifier for so long as the need for any information whatsoever about the object exists.
It is therefore apparent that the literature teaches away from the use of a centralized server for purposes of object management. It also apparent that there is a need for cross-platform, distributed database systems and methods that operate to transform data from heterogeneous data structures into a homogeneous data structure efficiently.
Other Approaches to Distributed Database Systems
Certain prior approaches to data storage and retrieval in distributed systems are concerned with optimization of read latency and data availability associated with objects resident on the distributed system. Some approaches concentrate on how to distribute updates or transaction activity to all of the copies of an object resident in the network. Usually, many copies are kept to enhance data availability should certain nodes on network or communication links fail. The more copies that are made, the more likely it is that a given object will be available for reading and will require less latency to obtain data associated with the object. Significant effort is thus required to keep copies of objects equal at the distributed locations.
One approach to maintenance of data in a distributed system is found in the SYBASE REPLICATION SERVER, a software database product made by Sybase, Inc., Emeryville, Calif. This system envisions maintaining multiple copies of a set of data on multiple servers, perhaps at multiple sites. Each copy at a remote site "subscribes" to a subset of data maintained at another site. The replication service keeps the multiple copies updated by replicating transactions initiated at a particular site directed against tables or data of interest, copying those transactions, and forwarding the copy of the transactions to remote destinations that apply these transactions to local copies of the data maintained at the remote sites. Thus, this system is essentially a "transaction store and forward" system based on subscription information.
Yet another approach to heterogeneous distributed database systems is the Thor object-oriented database management system being developed at Massachusetts Institute of Technology (MIT). See Liskov, et al., "Distributed Object Management in Thor", Programming Methodology Group Memo 77, to appear in "Distributed Object Management" by Ozsu, et al. (June 1993). (Some of the "object oriented" terms used here are defined later in this application.) The Thor system is a distributed system in which objects are stored at server nodes. Object repositories are provided for storing and managing persistent objects, and, ostensibly, indexes to find objects. Users interact with the distributed system at a front end computer system such as a terminal or personal computer (PC) that communicates with an object repository.
It appears that the Thor system requires significant processing overhead for object maintenance. The Thor authors specifically rejected use of a name service, and thus rejected use of logical references or pointers. Instead, Thor uses physical references or pointers to optimize read operations. Any object that is referred to by another related object appears to have some type of physical reference or link to other objects that have need of the data in the first object. If this is the case, then moving an object requires that all physical links or references from the referencing objects be updated so that such referencing objects can always "find" the object containing the data of interest. Although such a scheme may conceptually eliminate object replication and optimize read operations, it is complicated to maintain all of the physical pointers or links between related objects. Moreover, it appears that there is no client ownership of objects--all servers are equal and any client can update any object it can find, which creates complications of data security at distributed sites.