Field of Invention
The present invention relates generally to the field of data integration and data exchange. More specifically, the present invention is related to a system and method of integrating time-aware data from multiple sources.
Discussion of Related Art
With the amount and variety of data available, such as curated databases, enterprise data, and publicly available data over the Internet, it is rare for information about an entity to be completely contained and managed by a single data source. There is often great value in combining data from multiple sources, or in combining various versions of data reported by the same source over time, to produce a more complete understanding. For example, patients typically visit multiple medical professionals/facilities over the course of their lifetime, and often even simultaneously. While it is important for each medical facility to maintain medical history records for its patients, there is even greater value for both the patient and the medical professionals to have access to an integrated profile derived from the history kept by each institution. Likewise, a potential employer would find value in combining a job applicant's resume with other data, such as public profile data or even previous versions of a resume.
These examples illustrate that the time aspects of data can be critically important. It is important to know, for example, if two different drugs with adverse interaction have been prescribed to a patient in the same time period. Likewise, if different sources report that a job applicant has held multiple positions within the same time period, it would be useful for a hiring manager to know the order in which the titles were held in order to infer if the applicant was promoted, demoted, or perhaps provided an embellished resume.
Several challenges arise when integrating time-aware data, which refers to data that contain implicit time-specific information, such as the date of a prescription, or explicit time information, such as the version number of an instance. First, the time aspect associated with the data is often imprecise. A facility may report that the patient was treated for a condition on a specific date. From this information, one can infer that the patient must have had the condition on the day he/she was seen, but one cannot say if the patient still has the condition, or for how long prior to or after the visit that he/she had the condition. When combined with information from other visits to the same or other clinics, it is possible to incrementally create a more and more accurate medical history for the patient.
Second, as in traditional data integration, inconsistencies may arise with respect to certain specified constraints when data from multiple sources are combined together. An added complexity arises from the need to handle certain constraints across time (see paper to C. S. Jensen et al., “Extending existing dependency theory to temporal databases,” IEEE Trans. Knowl. Data Eng., 8(4): 563-582, 1996). For example, while it may be true that an employee may only receive one salary package from an employer at a time, it is possible for the employee to simultaneously receive multiple salary packages if he/she is employed by multiple companies at the same time. As another example, reports filed with the U.S. Securities and Exchange Commission (SEC) or corporate press releases may report that an executive held a particular title on a given day, but it does not provide information about when that title was first held, or even if it is still held after the report or press release is made public. Another data source (or even the same data source at a different point in time) may report that the executive was employed by the company at a date later than the date the first source reported his or her title. Both reports give imprecise information. What can be inferred about the employment history of the executive? Should it be assumed that he/she had been employed by the company as of the (earlier) date associated with his title, or should that value be disregarded in favor of the (later) date reported by the second source?
When integrating information about the same entity from multiple sources over time, the challenge is to maintain time consistency of the facts that are known about the entity, given that such facts are learned from different sources at different times, and the time associated with them may be imprecise. Ideally, the integration process should respect schema constraints and functional dependencies across time, and possess idempotent, commutative and associative properties to ensure a time-consistent profile of the entity, regardless of the order in which the facts are learned.
Current techniques do not provide such a guarantee. A standard bi-temporal database, for example, could be used to track when facts are learned, but it does not guarantee that the most current understanding of the facts will be the same, regardless of the order in which updates occur. Consider the following example:                UPDATE STOCKHOLDINGS FOR PORTION OF BUSINESS_TIME        FROM ‘08/23/2010’ to CURRENT DATE        SET SHARES=141,        WHERE NAME=‘Freddy Gold’        UPDATE STOCKHOLDINGS FOR PORTION OF BUSINESS_TIME        FROM ‘08/20/2010’ to CURRENT DATE        SET SHARES=396043,        WHERE NAME=‘Freddy Gold’        
If the updates are executed in this order, the database will record that the current understanding is that Freddy Gold has 396043 shares of stock since 8/23, however, if the order of the statements is reversed, the database will record that the current understanding is that Freddy Gold has 141 shares of stock, and this has been true since 8/20. While both facts may have been at different points in time, it is unclear how many shares of stock Freddy has today. Is the second update a correction to the first, or just a fact that arrived out of order? Such subtleties and challenges associated with the problem of consistently integrating time-aware data are explored with a concrete example next.