Large collections of data may be used in complex ways. For example, collections of data, such as in files, databases, and other data storage means, may be opened, queried, or used as part of a long string of activities, with different transformative activities occurring to the data, and with resulting data then stored.
For example, FIG. 1 is a block diagram of a data flow with two data inputs and two data outputs. As shown in FIG. 1, a file A and a database table A are both opened. As shown in FIG. 1, data is read from the file A 1000 and data is read from the database table A 1010. A union 1020 is performed on these two sets of data. The result of the union is examined, to determine (box 1030), for each record in the union of data, whether the value stored in an associated age field is less than fifty. For records where the associated age field is less than fifty, the records are aggregated by gender 1040, and stored in a database table B 1050. For records where the associated age field is not less than fifty, the records are stored in a file B 1060.
In order to allow for the use of large complex collections of data, ETL (Extract Transform Load) tools have been developed. These tools provide an automated way to perform operations using collections of data. ETL tools automate the tasks of extracting data—taking data from a data source; transforming data—utilizing the extracted data; and loading data—storing the result of the transformation is stored for later use. For example, the actions shown in actions are performed by an ETL tool.
In order to allow the easy use of such ETL functionality and expand the functionality available, design tools which allow the visual design of processes which use files or other data collections have been developed. One such design tool is known as Data Transformation Services (DTS), available from Microsoft Corporation. DTS allows a user to visually design processes by which data in files, databases, or other data collections can be used. The operations in the processes designed by DTS may include but are not limited to those available through standard ETL tools. For example, a DTS-designed data flow may allow a user to specify that certain files will be deleted, other files obtained (e.g. by file transfer protocol (FTP) from a designated source), and that a specific ETL process will then be performed on each file so obtained.
The data flow designed by an ETL tool or a design tool such as DTS is designed in advance of its use. This can lead to ambiguities when the data flow is used. For example, a data flow is designed to open a data source and, for each record in the data source, and read information in a specific column A and column C. However, at run-time, upon opening the data source, it may be that the data source contains, for each record, information in a column A, column B, column C and column D.
It may be that the designer of the data flow knew that column B would be included in the data source. If so, a design choice may have been made not to read information from column B, in order to minimize the time and other computational costs for doing so. Thus, asking the user at run-time whether column B should be included could cause unnecessary confusion and delay.
However, it may be that the designer of the data flow did not know that column D would be included in the data source, and that the user of the data flow would find column D useful to include in the data flow. Thus, asking the user at run-time whether column D should be included would be useful.
However, there is no way to distinguish between situations in which data was intentionally not included from a data source, and situations in which the data source has changed. Thus, either unnecessary questions are put to the user at run-time, or useful data may be lost.
Additionally, other changes may be made to a data collection. For example, the type of column A may have been changed from what is expected. This may or may not be compatible with the operations designed for column A in the data flow. Some changes in data type may allow operations may proceed successfully but with unexpected results. However, there is no way to tell whether the change was anticipated, or whether it was not. Again, the user is either consulted on data type incompatibilities, even in cases in which the change was anticipated, or the user is not consulted, which may allow problems to develop.
Thus, there is a need for a system and method to overcome these deficits in the prior art. The present invention addresses the aforementioned needs and solves them with additional advantages as expressed herein.