The total volume of data available in a business today is huge and increasing daily. Businesses need to integrate data from sources to target applications through the use of software developed for this purpose. The process for developing such software involves use of software development tools along with manual entry of requirements into spreadsheets. The software is often referred to as ETL (Extract, Transform, and Load), while the processes performed by the software are known as data integration processes.
Spreadsheets containing the requirements, once completed, are then given to an ETL Developer for the design and development of maps, graphs, and source code. However, this current process of documenting requirements from source systems into spreadsheets and then mapping these requirements into an ETL software, also called herein a data integration package, is time-consuming and prone to error.
For example, it takes a considerable amount of time to copy source metadata from source systems into a spreadsheet. The same source information must then be re-keyed into an ETL development tool. This source information metadata capture in a spreadsheet is largely non-reusable, unless a highly manual review and maintenance process is instituted.
Capturing source-to-target mappings with transformation requirements contains valuable navigational metadata that can be used for data lineage analysis. Capturing this information in a spreadsheet does not provide a clean automated method of capturing this valuable information.
Despite best efforts, manual data entry often results in incorrect entries. For example, incorrectly documenting an INT (integer) data type as a VARCHAR in a spreadsheet will require an ETL software developer to take time to analyze and correct.
Data analysts who perform the source-to-target mappings manually have a tendency to capture source/transform/target requirements at different levels of completeness. When there is not a standard approach to the requirements and design of the data integration process, there will be misinterpretation by the development staff in the coding requirements found in the spreadsheet source-to-target mapping documents, which will result in coding errors and lost time.
These and other shortcomings of current methods result because a typical data integration software development project requires the steps of analysis, design, and development (also referred to in the art as construction). Current tools for data integration are intended for use in the construction step only and do not address the analysis and design steps.
Improved methods are therefore needed to achieve better results in the art of data integration software development.
Traditional solution to integrating multiple data sources is to create a new data warehouse and copy the data from the original diverse data sources to the warehouse. This solution is not flexible to dynamic changes in the data sources. Brichta et al., for example, in U.S. Pat. No. 5,884,310 describe such a solution in which a common database server has a load engine. A plurality of source systems, each having an extraction engine and a transformation engine, have source databases that store data in disparate formats and file structures. The common database server load engine loads the disparate data which is extracted and transformed into a common database after which it may be provided to one or more target client systems.
Fagin et al. in U.S. Patent Application US 2004/0199905 describe a system and method for translating data from a source schema to a target schema. User inputs define a set of correspondences between the source schema and the target schema using an interpretation process of semantic translation.
Hamala et al, in U.S. Pat. No. 5,345,586 describe solving the problem of integrating multiple data sources by using a global data directory which maps the location of data, specific data entity attributes, and data source parameters.
Chen et al., in U.S. Patent Application US 2003/0149586 describe processing information for root cause analysis, including structured and unstructured data. The unstructured data is converted into a second structured format. The two structured data are collected and stored into memory.
Gupta et al, in the U.S. Pat. No. 6,513,059 and Mitchell in U.S. Patent Application US 2003/0195853 both mention data integration but do not address the aforementioned shortcomings of present data integration methods.
All of the above U.S. Patents and Patent Applications by Brichta, Fagin, Cupta, Mitchell, Hamala, and Chen shall be incorporated herein by reference in their entirety for any purpose.
It is believed that improved methods for performing data integration software development would constitute a significant advancement in the art.