1. Technical Field
Present invention embodiments relate to data integration, and more specifically, to design analysis of data integration jobs.
2. Discussion of the Related Art
Data integration is a complex activity that affects every part of an organization. Today, organizations face a wide range of information-related challenges: varied and often unknown data quality problems, disputes over the meaning and context of information, managing multiple complex transformations, leveraging existing integration processes rather than duplicating effort, ever-increasing quantities of data, shrinking processing windows, and the growing need for monitoring and security to ensure compliance with national and international law.
Current data integration platforms don't have analytical tools that provide analytical information across an extract-transform-load (ETL) system to end users that address a range of issues, including the following examples:
Issue 1: After upgrading the product from an earlier release to the latest version, users may wish to know whether any item listed in the release note and/or the technical note have an impact in the upgrade.
Issue 2: After installing a software patch, alternatively referred to herein as fix packs, users may note a change in behavior due to some defect fixes. Accordingly, the users may want to know if this change impacts the current process environment and, if so, how extensive the impact may be.
Issue 3: After installing the product on a new server with a newer C++ compiler, and determining for example, that columns containing floating point values do not return expected values, the user may want to know how many ETL jobs are affected by this change.
Issue 4: When a large number of ETL jobs are present in the system, e.g., over 20000 jobs, the user may want to know how many of those jobs would be affected if an environment variable is removed or updated.
Issue 5: In an environment with numerous ETL developers building different ETL applications for different lines of business, the user may want to know how to enforce coding standards, naming conventions, and common design patterns.
Issue 6: When a design defect is known to exist in a job template used by many developers for different needs, the user may want to know which jobs, if any, that were derived from the template contain the defect.
Currently, to address issues 1-4, the user would have to run all the tests, analyze the test results, and review each failed test. This approach actually works for limited failure cases, but is not practical if there is a large number of failing jobs. The approach does not work at all if a job executes successfully, but produces incorrect data. This is especially problematic if the incorrect data are consumed by other jobs. When this happens, the user has to trace back job by job, stage by stage to find the root cause.
A common solution adopted by many users for issue 5 is to pay consulting services to review job designs, share best practices, construct project structures, and develop job templates, etc.
For issue 6, some users keep notes on which template is used by which job design. Some users rely on the source control techniques to track the history of a job so that a root from which a particular job evolved can be located. Some users add annotations to the job design, and can only correlate one job to the other by reading and analyzing those annotations. Again, these approaches are tedious and inefficient, almost impractical if a large number of jobs involved.