1. Technical Field
Present invention embodiments relate to determining relationships between data by analyzing application behavior, and more specifically, to accurately identifying data relationships by analyzing run time behavior of applications.
2. Discussion of the Related Art
In order to improve the discovery of data relationships, various automated data discovery tools, such as IBM INFOSPHERE Discovery, have been developed. Automating the discovery of data relationships within and across heterogeneous systems allows a user to create a complete 360-degree view of various data assets. Automating this process may also reduce analysis time by up to 90%, improve accuracy, provide higher levels of visibility into potential data problems, and provide insight into business objects and transformation rules, which may speed time to value for critical initiatives, among other advantages. However, many current approaches to automatically discovering data relationships within data require extensive amounts of time and computing resources to determine data relationships, especially when a vast data set is being analyzed, and require significant user input in order to refine the accuracy of detected relationships.
More specifically, the work done by automated data discovery tools in examining data and detecting relationships is generally referred to as the discovery process and includes two phases: analysis and mapping. The analysis phase of the discovery process involves identifying data types, discovering the relationships within each data set (source and target), and using the identified data types and discovered relationships to discover the relationships between the source and target data sets. However, as relationships are discovered, most analysis tools require an analyst to review the discovered results to ensure accurate results. Typically, the analyst selects the most appropriate options to use in subsequent actions to determine which results would benefit from refinement or re-discovery. Usually, the best results are obtained when an analyst iteratively reviews the discovered results and approves only the most accurate relationships before proceeding.
In order to perform the aforementioned tasks, critical information must be obtained from subject matter experts (“SME's) to verify and validate the relationships discovered by the analysis tool. Specifically, since using data analysis tools to automatically derive relationships gives rise to a great number of false positives, SME validation is required to approve and filter the results. However, in many cases the required SME's might not exist, rendering validation impossible or at least impractical. For example, consider a corporate system that includes a number of data tables, including an employee table (with details relating to an employee's role in the corporation), an employee detail table (with details relating to the employee's personal information), a department table (with details relating to the employee's department), a manager table (with details relating to the employee's manager), and a store table (with details relating to the specific store that the employee works at). In such a situation, an automatic discovery tool may incorrectly automatically associate the employee's social security number with an employee and/or manager number, incorrectly automatically associate the store number with the manager and employee's zip code, and incorrectly automatically associate different phone numbers with each other simply because the form of the data is similar or because the data matches some other methodology followed by the analysis tool. Without the requisite SME's, these incorrect associations cannot be detected by the analysis tool and, thus, an analyst would be required to remove these incorrect relationships in order to ensure accurate discovery.
Moreover, typically, an automatic data analysis tool must compare each and every piece or segment of data in a data set with every piece or segment in data in another data set in order to determine where or if relationships exits. Thus, analyzing large data sets may take significantly more time due to the intensive nature of the work that must be performed. For example, if a data tool were used to determine any relationships between two two-column tables, each row from the first table may be compared to the row from the second table, similar to a database cross join, to derive potential relationships between the two tables. Thus, comparing even a two column, five row table with a two column, six row table may require at least 30 operations just to begin to determine relationships in the data. As the complexity of the analysis increases (e.g., by increasing the number of columns, tables, and/or records to be analyzed) resource consumption associated with the analysis increases exponentially. Consequently, using actual physical data analysis to derive potential relationships between various tables defined at the application level can be extremely time consuming and resource intensive.
The mapping phase of the discovery process uses the information discovered as a result of the analysis phase to discover the joins, bindings, and transformations that correctly derive the target data from the source data. Again, many data analysis tools may automatically complete this phase, however, in order to achieve the highest possible accuracy, an analyst reviews and analyzes the results and runs additional discovery and refinement steps (e.g., filters or aggregation) to obtain the most accurate and complete transformations.