When designing integration flow plans such as extract-transform-load (“ETL”) processes, two objectives that are typically considered are correct functionality and adequate performance. Functional mappings from operational data sources to a data warehouse should be correct and an ETL process should complete within a certain time window. However, two other objectives that also may be considered by integration flow plan designers are fault tolerance (also referred to as “recoverability”) and freshness. Fault tolerance relates to the number of failures that an integration plan can tolerate and still complete within a performance time window. Freshness relates to the latency between the occurrence of a business event at a source system and the reflection of that event in the target system (e.g., a data warehouse).
An integration flow plan should be fault-tolerant and yet still satisfy a freshness requirement to finish within a specified time window. One strategy that may be employed to make an integration flow plan fault tolerant is to repeat an integration flow plan in the event of a failure. However, repeating the entire integration flow plan may not be feasible if the dataset is large or the time window is short. Another way to make integration flow plans fault tolerant is by adding recovery points. A recovery point is a checkpoint of the integration flow plan state and/or a dataset snapshot at a fixed point in the flow. If a recovery point is placed at an operator, as a dataset is output from the operator, the integration flow plan state and/or the dataset may be copied to disk. If a failure occurs, flow control may return to this recovery point, the state and/or dataset may be recovered, and the integration flow plan may resume normally from that point. This may be faster than restarting the entire integration flow plan since operators prior to the recovery point are not repeated.
However, there may be a cost associated with recovery points. Inserting the recovery point may have a cost. Additionally, maintaining the recovery point may include recording state data and a dataset to disk, which requires additional overhead of disk I/O. Thus it may not be feasible to place recovery points after every operation in an integration flow plan. Accordingly, a designer may be required to decide where to insert recovery points in an integration flow plan.
Currently, this issue may be addressed largely based on the experience of the designer, e.g., one designer might place recovery points after every long-running operator. However, with complex flows and competing objectives there may be an enormous number of choices, and so design produced by a designer may not be optimal.
An exemplary approach is to formulate the placement of recovery points as an optimization problem where the goal is to obtain the best performance when there is no failure and the fastest average recovery time in the event of a failure. Given an integration flow plan with n operators, there are n−1 possible recovery points. Any subset of these n−1 recovery points is a candidate solution. Therefore, the search space is given by the total number of combinations of these n−1 recovery points:totalRP=2n−1−1
The cost of searching this space may be exponential where the number of operators is O(2n). The search space may be even larger if other strategies for fault tolerance are to be considered. In addition, the integration flow plan design may have other objectives that must be considered such as freshness, cost, and storage space. There also may be additional strategies to consider for improving performance, such as parallelism. These considerations may expand the search space to a size that is impracticable for a designer to search manually.