Large amounts of data is often managed through a data processing system. The data processing system may be implemented on one or more computers coupled to one or more data sources. The computers may be configured to perform specific data processing operations through use of software programs. Those software programs may be heterogeneous, written in multiple programming languages.
As an example, a university may process student registrations with a data processing system that receives as input data from students as they request classes. That data may include a student name or identification number, a course title or number, a professor and other information about the student or the course. The processing system may be programmed to process each request by accessing various data sources, such as a data store containing information about each enrolled student. The data processing system may access this data store to determine that the request to enroll in a class is made by a student in good standing. Another data store may contain information about the class, including prerequisite classes. The data processing system may access this data, in combination with data from another data store indicating classes already completed by the student requesting a class, to make a determination of whether the student is qualified to enroll in the requested class.
By processing the accessed data in accordance with a program, the data processing system may generate data indicating that the student is enrolled in the requested class and that the class roster includes that student. When the data processing system completes processing of the class registration request, the data processing system may have accessed many data sources and may have entered or changed data in multiple data stores.
A similar pattern occurs in many other important applications, with the data processing system accessing multiple data sources and generating or changing values stored in other data stores based on data accessed from the data sources or based on the values of other variables. Accordingly, many values of variables may depend on the values of other variables, whether those other values are accessed from a data source or exist within the data processing system.
It is often useful for a data processing system to provide dependency information about the data elements generated or modified by the data processing system. Ab Initio of Lexington, Mass., USA provides a “cooperating system” that provides dependency information based on programs created to execute on the cooperating system. The cooperating system executes programs expressed as data flow graphs, represented as operators and data flows between operators. A tool may analyze the graph and determine dependencies among variables by identifying an operator in which the value of a variable is set based on the values of one or more other variables. By tracing back the flows into the operator, the tool may identify other variables on which the variables input to the operator in turn depend. By tracing through the graph, the tool may identify all the dependencies, whether direct or indirect, for any variable.
This dependency information may be entered into a metadata store from which it can be used for many functions involving the graph. For example, data quality rules may be applied to identify variables with unexpected patterns of values. Those variables, as well as any variable within the graph that has a dependency on those variables, may be flagged as suspicious. As another example, if a portion of the graph needs to be changed to provide a new functionality, correct an error, or remove functionality that is no longer desired, other portions of the graph that depend, through the variables generated or modified in that processing, may be readily identified. Following the change, those portions of the graph may be scrutinized to ensure that their function is not compromised by changes made elsewhere in the graph.
In some instances, a data processing system may use other programming languages instead of or in addition to a graph. For example, a data source may be a data base that is programmed in SQL or a programming language defined by the supplier of the database to enable a user to program queries to be run against that database. In a data processing system as might be implemented in a large company, there may be multiple programs written in multiple languages instead of or in addition to a graphical programming language. Dependency analysis tools that operate on programs expressed as graphs do not process these programs written in other languages. As a result, dependency information for the entire data processing system may be incomplete or may require significant effort to generate.