1. Field of Invention
The present invention relates generally to the field of data integration and data exchange. More specifically, the present invention is related to a system and method of integrating time-aware data from multiple sources.
2. Discussion of Related Art
With the amount and variety of data available, such as curated databases, enterprise data, and publicly available data over the Internet, it is rare for information about an entity to be completely contained and managed by a single data source. There is often great value in combining data from multiple sources, or in combining various versions of data reported by the same source over time, to produce a more complete understanding. For example, patients typically visit multiple medical professionals/facilities over the course of their lifetime, and often even simultaneously. While it is important for each medical facility to maintain medical history records for its patients, there is even greater value for both the patient and the medical professionals to have access to an integrated profile derived from the history kept by each institution. Likewise, a potential employer would find value in combining a job applicant's resume with other data, such as public profile data or even previous versions of a resume.
These examples illustrate that the time aspects of data can be critically important. It is important to know, for example, if two different drugs with adverse interaction have been prescribed to a patient in the same time period. Likewise, if different sources report that a job applicant has held multiple positions within the same time period, it would be useful for a hiring manager to know the order in which the titles were held in order to infer if the applicant was promoted, demoted, or perhaps provided an embellished resume.
Several challenges arise when integrating time-aware data, which refers to data that contain implicit time-specific information, such as the date of a prescription, or explicit time information, such as the version number of an instance. First, the time aspect associated with the data is often imprecise. A facility may report that the patient was treated for a condition on a specific date. From this information, one can infer that the patient must have had the condition on the day he/she was seen, but one cannot say if the patient still has the condition, or for how long prior to or after the visit that he/she had the condition. When combined with information from other visits to the same or other clinics, it is possible to incrementally create a more and more accurate medical history for the patient.
Second, as in traditional data integration, inconsistencies may arise with respect to certain specified constraints when data from multiple sources are combined together. An added complexity arises from the need to handle certain constraints across time (see paper to C. S. Jensen et al., “Extending existing dependency theory to temporal databases,” IEEE Trans. Knowl. Data Eng., 8(4): 563-582, 1996). For example, while it may be true that an employee may only receive one salary package from an employer at a time, it is possible for the employee to simultaneously receive multiple salary packages if he/she is employed by multiple companies at the same time. As another example, reports filed with the U.S. Securities and Exchange Commission (SEC) or corporate press releases may report that an executive held a particular title on a given day, but it does not provide information about when that title was first held, or even if it is still held after the report or press release is made public. Another data source (or even the same data source at a different point in time) may report that the executive was employed by the company at a date later than the date the first source reported his or her title. Both reports give imprecise information. What can be inferred about the employment history of the executive? Should it be assumed that he/she had been employed by the company as of the (earlier) date associated with his title, or should that value be disregarded in favor of the (later) date reported by the second source?
When integrating information about the same entity from multiple sources over time, the challenge is to maintain time consistency of the facts that are known about the entity, given that such facts are learned from different sources at different times, and the time associated with them may be imprecise. Ideally, the integration process should respect schema constraints and functional dependencies across time, and possess idempotent, commutative and associative properties to ensure a time-consistent profile of the entity, regardless of the order in which the facts are learned.
Current techniques do not provide such a guarantee. A standard bi-temporal database, for example, could be used to track when facts are learned, but it does not guarantee that the most current understanding of the facts will be the same, regardless of the order in which updates occur. Consider the following example:
UPDATE STOCKHOLDINGS FOR PORTION OF BUSINESS_TIMEFROM ’08/23/2010’ to CURRENT DATESET SHARES = 141,WHERE NAME = ’Freddy Gold’UPDATE STOCKHOLDINGS FOR PORTION OF BUSINESS_TIMEFROM ’08/20/2010’ to CURRENT DATESET SHARES = 396043,WHERE NAME = ’Freddy Gold’
If the updates are executed in this order, the database will record that the current understanding is that Freddy Gold has 396043 shares of stock since August 23, however, if the order of the statements is reversed, the database will record that the current understanding is that Freddy Gold has 141 shares of stock, and this has been true since August 20. While both facts may have been at different points in time, it is unclear how many shares of stock Freddy has today. Is the second update a correction to the first, or just a fact that arrived out of order? Such subtleties and challenges associated with the problem of consistently integrating time-aware data are explored with a concrete example next.
Motivating Example:
FIG. 1 shows a simplified form of a real example where information about Freddy Gold is integrated from data extracted from several sources, including different reports filed with the SEC (Forms 10K and Forms 3/4/5) that are available via the EDGAR database (see SEC website regarding The EDGAR Public Dissemination Service), different versions of resumes, corporate websites, and news articles available electronically. For simplicity, it is assumed that each row shown on the left of FIG. 1 represents a separate filing or a version, even though in general, a filing or version may contain many rows of data.
For example, “SEC filings” in FIG. 1 show 7 facts taken from 7 reports filed with the SEC, each of which indicates the number of shares of a particular stock (OLP and BRT) held by Freddy Gold during the second half of 2010. The first row is a report that is filed on July 1 and indicates Freddy owned 396043 OLP shares on July 1. Though the date associated with the filing only records the day on which the fact was known to be true, it is reasonable to assume that the data in the filing are true until new information is received, such as from the report shown on the second row that indicates Freddy owned 13415 OLP shares on August 25.
At the same time, data extracted from different versions of corporate websites and news articles contain partial information about Freddy's employment history, and different versions of Freddy's resume give partial information about Freddy's education and employment history. How can the given information be best reconciled to compose a time-consistent profile so that one could understand his job history or how many shares of OLP he owned, for example on August 24? Next, an answer to this question is addressed.
A first examination of the SEC reports indicates that it would seem reasonable to assume that Freddy had 141 OLP shares on August 24, since the third report indicates that this was the case since August 23. However, the 4th and 5th reports filed at the later date of August 30 indicate that Freddy had 1322179 shares of OLP on August 20 and this number only changed on August 26 to 396043 shares. So, did Freddy own 141 shares or 1322179 shares on August 24? Since the 4th and 5th filings were reported at a later date (i.e., it is more recent information that ‘corrects’ the earlier information), it would seem reasonable to assume that Freddy had 1322179 shares on August 24. If the same logic is applied to the 6th and 7th filings about his stock holdings in BRT, then Freddy must own 1820 shares of BRT on July 14. Alternatively, if the reports simply arrived out of order, then Freddy owned 141 shares of OLP on August 24 and 0 shares of BRT on July 14.
The discussion above raises subtleties that may arise when interpreting and integrating time-specific information under a constraint that is implicit in this example; Freddy can hold only one quantity of shares of a specific stock at any point in time. Hence, when conflicts arise (i.e., when there are at least two different possible number of shares of a stock held by Freddy at some point in time), one needs to resolve the conflict and decide the “right” number of shares under Freddy's integrated profile. One possible interpretation is shown on the right of FIG. 1. As shall be explained later, other interpretations of Freddy's stock holdings are possible depending on how the given dates are interpreted.
This example points out the need for an extensible framework to support different policies for integrating time-aware data. Regardless of the strategy used to resolve conflicting information, the integrated outcome (modulo syntactic representation of time) should be agnostic to the order in which data sources are integrated.
The discussion below describes known prior art techniques for data integration and data exchange.
Data Integration and Data Exchange
Even though tremendous progress on data integration and data exchange has been made in the past few decades, prior techniques and systems for data integration and data exchange are largely agnostic to time, and hence, they cannot be immediately applied to satisfactorily build an integrated archive over time. In fact, assuming that all extracted data are placed in a format ready for integration, the state-of-the-art data integration and data exchange systems still cannot be used to automatically derive a consistent understanding of Freddy Gold's longitudinal profile, such as what is shown on the right of FIG. 1 and FIG. 2B. It would require non-trivial extensions and in particular, the use of ad hoc functions to create a temporally consistent view of the data sources under known constraints. Except for the paper to H. Zhu et al., “Effective data integration in the presence of temporal semantic conflicts,” Intl. Symp. on Temporal Representation and Reasoning, TIME, pp. 109-114, 2004, which provides a discussion on three types of temporal heterogeneity that need to be resolved when integrating data across time, the problem of integrating and exchanging data across time has not been systematically and thoroughly addressed in prior work in this area.
What is needed is a systematic extension of a data exchange system that can be used to integrate and exchange data across time. A data exchange specification is a triple (S, T, Σ), where S is a source schema and T is the target schema and Σ is a set of schema mappings, which are high-level declarative specifications of the relationship between instances of two schemas. Given a source instance I of S, the goal of data exchange is to materialize a target instance J of T so that I and J together satisfy Σ. The generic architecture of a data exchange system consists of a module that takes the specification and compiles it into executable code. The executable code can then be applied to I to obtain J (e.g., see the paper to L. Popa et al., “Translating Web Data,” VLDB, pp 598-609, 2002). The target instance can also be obtained by applying the chase procedure on I with respect to the specification. A fundamental assumption that is often implicit in the data exchange framework is that the target instance is created as a union of facts that are obtained from the result of the data exchange. After the exchange, all target facts are unioned to obtain J, where under set union, the set of all identical facts are fused into one. When conflicting facts arise in the presence of functional dependencies in the target (which are modeled as target equality generating dependencies), the data exchange will fail and no target instance will be materialized. Users are often left to deal with the inconsistencies manually or apply data cleaning techniques to resolve inconsistencies. There are no known techniques for resolving inconsistencies in data across time. In fact, ad hoc functions are typically added to manage inconsistencies with respect to time during data integration.
(Bi-)Temporal Databases
There is a large body of work on bi-temporal databases. Chapter 14 in the book by J. Chomicki et al., Temporal Databases, Foundations of Artificial Intelligence, Elsevier, 2005 and the book C. S. Jensen et al. Eds, Temporal Database Entries for the Springer Encyclopedia of Database Systems, Springer, 2009 provide a comprehensive overview of related work and concepts in this area. Techniques in bi-temporal databases cannot be immediately applied to integrate and exchange data across time. First, bi-temporal databases have only two specific notions of time, namely valid-time and transaction-time (which are also known as application-time and, respectively, system-time). Valid-time denotes that time at which a tuple is valid in the real-world, while transaction-time denotes the time updates are entered into the database and hence, it can only increase as updates are entered. However, the order of integration, whether according to asof or reported time, may not respect transaction-time semantics. The work of M. Roth and W-C Tan in the paper, “Data integration and data exchange: It's really about time,” In CIDR, 2013 provides a detailed example and discussion on why bi-temporal databases cannot be applied. Second, the valid-transaction-time semantics is not always the “right” semantics. In fact, none of the integrated archives shown in FIG. 2B corresponds to the result that one would obtain with valid-transaction time semantics. Different applications may require different semantics to integrate data across time. The “correct” semantics may depend only on the application at hand and this running example points out the need to provide an extensible framework that goes beyond valid-transaction-time semantics so that alternative semantics can be adopted as needed. In principle, additional attributes can be added to a relation to capture application-specific time-related information that may exist in the data (such as asof and reported time). However, such additions will necessitate the use of (ad hoc) triggers, user-defined functions, or stored procedures to manipulate time in the way that is desired. Finally, except for the paper to F. Currim et al., “TX schema: Support for data- and schema-versioned xml documents,” Technical Report TR-91, TimeCenter at Aalborg University, September 2009 and the paper to H. J. Moon et al., “Managing and querying transaction-time databases under schema evolution,” PVLDB, 1(1): 882-895, 2008, most implementations of bi-temporal databases are relational. The work identified above of F. Currim et al. and H. J. Moon et al. (which stores relational data in XML) follow bi-temporal valid-transaction-time semantics and significant logic will need to be added to allow time to be manipulated in alternative ways.
Archiving, Versioning, and Annotation Systems
Different techniques for archiving data exist, going back to multi-version control systems (see the paper to P. A. Bernstein et al., “Concurrency control in distributed database systems,” ACM Comput. Surv., 13(2): 185-221, June 1981) with certain ACID guarantees, diff-based version management systems (e.g., see the paper to A. Marian et al., “Change-centric management of versions in an xml warehouse,” VLDB, pp. 581-590, 2001), or reference-based approaches (e.g., see the paper to S-Y Chien et al., “Efficient management of multiversion documents by object referencing,” VLDB, pp. 291-300, 2001) for hierarchical data, to techniques that compact versions based on key constraints (see the papers to P. Buneman et al., “The database wiki project: A general purpose platform for data curation and collaboration,” SIGMOD Record, 40(3): 15-20, 2011, and Archiving scientific data, ACM TODS, V29, pp. 2-42, 2004, and the paper to I. Koltsidas et al., “Sorting hierarchical data in external memory for archiving,” PVLDB, 1(1): 1205-1216, 2008). Archiving can be construed as a form of data integration across versions of data. Techniques based on key constraints have the advantage over version or reference-based approaches because they explicitly track the evolution of entities over time. However, all the systems above apply only to a single dimension of time (i.e., versions of data) and cannot be immediately generalized to manage multiple dimensions of time. Time-specific information can be regarded as a type of annotation and the “additive” commutative monoid of a provenance semiring can be applied to obtain a union of such annotated data sources (see the paper to T. J. Green et al., “Provenance semirings,” PODS, pp. 31-40, 2007 and the paper to E. V. Kostylev et al., “Combining dependent annotations for relational algebra,” ICDT, pp. 196-207, 2012). However, a mechanism for understanding how conflicts can be resolved when combining annotations is still required to ensure that constraints in the target schema are satisfied.
Data Conflict Resolution
Data conflict resolution for integration is a well-studied area (see the paper to J. Bleiholder et al., “Data fusion,” ACM Comput. Surv., 41(1): 1-41, 2009 and the paper to X. L. Dong et al., “Data fusion—resolving data conflicts for integration,” PVLDB, 2(2): 1654-1655, 2009). However, existing techniques for data conflict resolution are agnostic to time.
Complex Event Processing, Streams, and Uncertain Data
Complex event processing and data streams is another area of related research (see the paper to R. S. Barga et al., “Consistent streaming through time: A vision for event stream processing,” CIDR, pp. 363-374, 2007). The goal of such systems is to make decisions based on continuously streaming data that may arrive in order or out-of-order (see the paper to M. Liu et al., “Sequence pattern query processing over out-of-order event streams,” ICDE, pp. 784-795, 2009), and for which the time element associated with data values may be known with certainty or may be imprecise (see the paper to H. Zhang et al., “Recognizing patterns in streams with imprecise timestamps,” Proc VLDB Endow., 3(1-2): 244-255, September 2010). Data integration scenarios introduce requirements to model constraints of time-aware data, and to enable specification of application-specific policies to resolve violations as part of the integration process to produce a consistent integrated result.
Embodiments of the present invention are an improvement over prior art systems and methods.