The present invention relates generally to data management systems and in particular to ETL (extract, transform, load) tools for managing multiple data sources.
Running a successful business requires answers to questions such as how well the business is functioning compared to others, where are the trouble spots in the organization, where can costs be reduced, how are various components of the business performing relative to each other, and so on. The answers to these questions (and others) may lie buried within the data collected by the organization. Techniques for answering such questions typically include culling together the data and producing various reports to provide useful summaries and to reveal trends. Statistical tools and other analytical techniques can be applied to the data to give meaning to what might otherwise appear to be unrelated information. An important aspect of data management is data mining. This is a sophisticated technique which uses intelligent software such as expert systems to detect patterns in the data that other techniques may not detect.
In any large enterprise, however, data is likely to be accumulated in all areas of the enterprise and usually in many incompatible formats. To effectively analyze the data, it is important that there be some degree of uniformity in the data. A good data warehouse or data mart with “clean” data is likely to produce much better results than “raw” data copied from various sources.
A solution usually employed is to extract the data from its various sources, transform it into a consistent format, and then load it into a single database. The database is commonly referred to as a data warehouse or a data mart. Many systems exist that provide this service. SAS and Informatica are two examples that can handle very large amounts of data.
The software tools for doing this are called ETL (extract, transform, load) tools. An ETL tool basically pulls the data from various sources, manipulates it, and prepares it for loading into a data warehouse. The extraction process accesses data through a process of record selection from one or more source databases. The extraction process writes the records to an operational data store for further processing. Once the data is “extracted,” it is “transformed” during a transformation process. Some transformation processes include: filtering, where only certain records are kept based on some criteria, e.g., records with certain data values or ranges; summarizing, where two or more records might be combined into a summary record; merging, where two or more “source” records might be merged into a “target” output record; transposing, where information in a record might be converted from one form to another; derivations, data in a record may be combined by mathematical computations to produce a different set of data. Typically, the transformation includes a “cleansing” operation on the transformed data. This process ensures that the data is consistent, of a known recognized value, and otherwise is consistent with a desired data model. The final process in an ETL tool is the loading of the transformed data into the data warehouse or data mart.
The popularization of the Internet has spurned the globalization of businesses. As the infrastructure of the internet continues to develop it becomes easier and more convenient to distribute parts of an enterprise anywhere in the world and to connect the components of the enterprise via the internet. However, movement of large amounts of data over a network can be slow due to heavy network traffic. Data transfers in a local area network (LAN) can be slow since the LAN is used by everyone in the local area. Similarly, wide area networks (WAN) can exhibit slowness, again due to heavy usage, but also because of the switching that is needed to route the data to its destination. This can adversely affect the efficiency of ETL tools, since typically large amounts of data must be gathered from many sources. The problem is amplified if these large amounts of data have to be moved from locations all over the globe.
Given the importance of data warehousing that ETL tools provide, there is a need to improve the performance of ETL tools in today's increasingly distributed data environment.