Some existing systems provide monitoring during execution of workflows on distributed nodes. For example, some systems permit monitoring during disaster recovery of virtual machines (VMs) operating in cloud environments. In some systems, monitoring the implementation of workflows across multiple management nodes is performed through a special workflow monitoring user interface (UI).
However, existing methods do not scale out effectively. For example, a recovery workflow may contain 5-10 tasks for each VM. Consequently, a recovery workflow for a recovery plan with 1,000 VMs would contain 5,000 to 10,000 tasks, and each task produces periodic progress updates and a final succeeded/failed status update. In order to scale out, some cloud services use an eventually consistent database to persist data. But this kind of database lacks support for conditional based queries, such as structured query language (SQL) relation databases. For example, if a workflow has 5000 tasks, the database table has 5000 rows. Each row stores the name, progress, start time, etc., for each task. Since it is not possible to query the database for specific rows, each node in the cluster has to load all the rows into memory and perform in-memory filtering. This solution does not work well for workflow monitoring because the workflow monitoring data takes a very substantial amount of memory for each node.
Furthermore, due to the nature of eventual consistency, each management node could get a slightly different view of the database table depending on which node in the cluster from which it reads the data. This could result in the monitoring user interface (UI) receiving inconsistent and fluctuating responses to the monitoring requests to the server. As a result, the workflow monitoring UI may display inconsistent and fluctuating information.