In a big data service processing process, extract, transform, and load (ETL) is responsible for obtaining data from various data source systems and supports obtaining data from a database or a file system similar to a data source system, and logical processing such as conversion, cleaning, association, and gathering is performed by the ETL on the data, and then the data is loaded to a target system according to a service requirement.
A unified scheduling module of the ETL performs timing scheduling on a workflow and schedules and executes a task inside the workflow.
A dependency is implied between tasks in a same workflow, and different workflows and different workflow tasks also have mutual dependencies. Dependence herein mainly is data dependence. For example, a next task is executed only when a dependent task is processed completely.
A typical service scenario is described below. Procedure 1 is scheduled once each hour, procedure 2 needs to depend on procedure 1, and procedure 2 is executed only when data conversion and storage of the 24-hour procedure 1 succeeds. Procedure 3 depends on procedure 2, and procedure 3 is executed only after procedure 2 is successfully executed. Execution of procedure 4 is directly triggered after procedure 3 is executed completely. Dependence on and triggering of the foregoing tasks is basically data dependence. When dependent data changes and needs to be re-executed, correspondingly, a subsequent task is affected and execution of the subsequent task needs to be started from an appointed node in sequence.
In an operation and maintenance process, if it is found that original data of two days ago has a problem, the data needs to be retransmitted to a database of procedure 1; in this case, the retransmitted data needs to be reprocessed, and procedures of data directly depending on procedure 2 or indirectly depending on procedures 3 and 4 need to be rerun.
1. A procedure corresponding to the retransmitted data and a status of an affected task need to be reset, and then execution is recovered from an appointed task of the procedure.
2. Then a first layer of dependence is found according to a dependence configuration task in a procedure configuration, multiple procedures may be found, a corresponding period may be found for each procedure, a status is reset from an appointed task of a procedure, and then execution is recovered.
3. For each procedure found in 2, a subsequent first layer of dependent procedures or tasks are searched repeatedly, a corresponding period is found for each found procedure, status resetting is started from an appointed task of the procedure, and then execution is recovered.
4. The third step is repeated until all affected procedures are executed completely.
In the foregoing data processing process, the following problem exists. Basically, data collected from a data source is basic data. Therefore, the basic data or external data representation (XDR) data affects other subsequent procedures to some extents. In a particular domestic site, a quantity of procedures that are subsequently directly affected or indirectly affected by the most important procedure data A is more than 100. In this case, when procedure data A has a problem, after retransmission, a maintenance operation in one day is redoing corresponding tasks. Consequently, an operation thereof is excessively complex, and a maintenance difficulty is very high. In a maintenance process, a case of omission possibly exists, causing data results to be inconsistent. Because dependencies are complex, some tasks may not be reset due to carelessness during operations, causing data to be incomplete.