Extract, Transform and Load (ETL) generally refers to a process in database usage that extracts data from a source system, transforms the data by storing it in proper format for querying and/or analysis purposes, and loads the data into the final target system. Data integration and transformation software applications (e.g., SAP Data Services) employ ETL tools to enable users to develop and execute workflows that retrieve data from predefined data sources (e.g., applications, Web services, flat-files, databases, etc.), transform that data, and then output the results back to the same or different data sources (i.e., target).
While using an ETL tool to extract or load data files, a user typically associates location parameters of the data files to file format objects. Whenever there is a change in file location parameters, however, the user needs to find and modify all the file format objects that are affected. This is potentially labor-intensive, particularly if there is a large number of file formats with the same data file locations (e.g., one common directory C:\DataFile\).
Another challenge may be posed by the different file directory syntaxes imposed by different operation systems. For example, the UNIX file directory syntax (e.g., /usr/User1/datafile) is different from WINDOWS file directory syntax (e.g., c:\Users\User1/datafile). If users want to process data files for the same file format, but in different operation systems, they need to create duplicate file format objects for the different operating systems (e.g., one for WINDOWS system and another for UNIX system).
Yet another challenge relates to the use of variables in specifying file locations and/or names. When users design a file format in an ETL dataflow, they may not know the exact file location and file name. In such case, they typically define file location parameters as variables, which are defined or filled during runtime of the ETL job. If the ETL job is published as a Web service, the end user is allowed to substitute the variables with, for example, a name of an unauthorized location of a system and extract or corrupt the file.
A further challenge relates to the configuration of file location parameters for remote systems or directories. When an ETL developer designs a file format to extract or load data files, he or she may not know at that time if the data file location is at a remote system or a local system. If the file location is at a remote system, the ETL developer typically does not know which file access protocol (e.g., file access protocol or FTP, secure file access protocol or SFTP, secure copy protocol or SCP, FTP over secure sockets layer or FTPS, etc.) will be available in the remote system to access the file.
When an ETL job is deployed to a production system, the job may access a remote system using a secure file access protocol to download or upload files for ETL processing. Suppose the secure file access protocol daemon (e.g., SFTP daemon for SFTP protocol) is down due to some technical reason in the remote system. In this scenario, when the ETL job runs, it will fail as it cannot access the remote file. The ETL administrator may intervene to provide a temporary solution by manually copying files from the remote system to a local system by some other means (e.g., access through a shared file mount). In order to run the job while the remote server file access daemon is down, the ETL administrator also needs to change the ETL job definition to disable access of the file from the remote system, and instead read the file from the local system. Generally, changing the job definition to accommodate such temporary solution is not an efficient approach. It requires seeking the approval from many stakeholders, which is not an easy process that can be achieved in a timely manner.