1. Field of Invention
The present invention relates generally to the field of online analytic processing of data. More specifically, the present invention is related to aggregation queries over dynamically discovered entities, that appear with different guises in a database, without explicitly resolving the correct values for the attributes of the entities.
2. Discussion of Prior Art
Modern business intelligence routinely makes extensive use of customer and transactional data obtained from databases stored in data warehouses. Such business intelligence may typically be obtained by posing an analytical query to one or more associated relational databases. Execution of the analytical query may require data processing involving extensive tabular calculations, time series analyses, and retrieval of various aggregated data. These data are typically organized into tables of columns and rows, or tuples, with many tables comprising multiple tuples and multiple columns. Such data processing is generally referred to as online analytical processing (OLAP) and may include software designed to access and process raw data for use by business data analysts.
The business intelligence thus requires the integration of a plethora of different sources. Dimensional information such as customers, suppliers, and products from different sources very often overlap and appear with different guises in different systems. Even operational systems within an enterprise are usually isolated without any data integrity. The situation may be exacerbated by business acquisitions or mergers and partnerships which bring together disparate data sources, and by the increasing use of internal application data with outside data, such as may be generated by retailers and syndicated data feeds.
Entity Discovery is the process of determining if multiple instances are actually the same entity (like customer, supplier or product). Such a process can be automated and usually works by examining and combining similarities between the attributes of an entity. For example, consider two customer instances <John, Simpson, 555-123-4568, San Jose> and <Johnny, Simpson, 555-123-4567, San Francisco>. The high similarities between the first name, last name and phone can be an indicator that the two customers are one entity.
Entity Resolution is the process of determining the correct values for the attributes of an entity. For the example above, the correct values can be <John, Simpson, 555-123-4567, San Francisco>. Typically the entity resolution is performed statically on a data-warehouse and is very time-consuming. However such an approach is seldom exact. Additional corrections may be required as more instances become available rendering the static resolution obsolete.
The phenomenon where the same entity appears with different attributes values in different systems is called “Data Inconsistency” and the corresponding databases are called “Unresolved Databases”. Data inconsistency can have deleterious effects on business intelligence applications if it is not treated correctly. Performing aggregation directly on unresolved databases typically generates erroneous results. To illustrate the present state of the art, reference is made to Tables 1 and 2 that provide an example of sales activity with sales data integrated from two different operational systems.
TABLE 1CUSTOMER DATAOperationalSystemNameCustIDCityState1JohnC1San FranciscoCA2JohnnyC2San JoseCA1JackC3San FranciscoCA1WilliamC4San FranciscoCA2BillC5San JoseCA
In Table 1, customers (i.e., Instances) identified by CustID labels C1, C3 and C4 may originate from a first operational system and customers identified by CustID labels C2 and C5 may originate from a second operational system. Transactional data corresponding to the customers listed in Table 1 are given in Table 2:
TABLE 2TRANSACTION FACT DATAOperationalSystemTransIDCustIDSalesAmt1Tr1C1$202Tr2C2$501Tr3C3$301Tr4C4$902Tr5C5$40
An analytical query may be placed to determine, for example, a reply to the inquiry, “What is the sum of sales grouped by city and state?” For the data presented in Tables 1 and 2, unresolved aggregation results may be given by Table 3:
TABLE 3UNRESOLVED AGGREGATION RESULTSCityStateSalesAmtSan FranciscoCA$140San JoseCA $90
It can be appreciated by one skilled in the art that it is very common for a particular customer to appear under different guises in different data sources. An entity discovery process can be used to provide an entity mapping, such as shown in Table 4:
TABLE 4ENTITY MAPPINGEntityIDCustIDe1C1e1C2e2C3e3C4e3C5
By inspection, it can be seen that customer C1 from the first data source is most probably the same customer as customer C2 from the second data source. Accordingly, an EntityID label e1 may be used to refer to both customers C1 and C2. Similarly, customer C4 from the first data source is most probably the same customer as customer C5 from the second data source. An EntityID label e3 may be used to refer to both the customers C4 and C5. In the present state of the art, a warehouse administrator may use this information to manually resolve the data by choosing the correct dimensional data for such entities, as presented in Table 5.
TABLE 5POSSIBLE RESOLVED CUSTOMER DATAEntityIDCityStatee1San FranciscoCAe2San FranciscoCAe3San FranciscoCA
An example of a resolved aggregation from Tables 3 and 4, resulting from such an entity discovery and resolution process and performed in accordance with the present state of the art, may provide the result given in Table 6:
TABLE 6POSSIBLE RESOLVED AGGREGATION RESULTCityStateSalesSan FranciscoCA$230
It can be appreciated that the actual dimensional information of an Instance, such as a customer, may be unknown. Although different techniques can be applied in order to identify the same, or equivalent, customers, the problem of assigning the “correct” dimensional information (e.g., address, age, income range) to the corresponding entity is a dynamic task. That is, as more data become available, the Entity Discovery and Resolution continues to change so that the resolved aggregation results change. Accordingly, the result in Table 6 is most likely incorrect as the conventional process producing the result failed to take into account the step of establishing the identity of different Instances as the same customer.
The inconsistencies of the data can disappear at higher hierarchical levels for domains like time, location, customer age or income range. For example, in Tables 5 and 6 it can be seen that the aggregation result for the State CA is correct since the inconsistency at the state level for the given customer data disappears at the state level.
From the above, it is clear that there is a need for a principled and correct way of performing aggregations over inconsistent data.