The complexity of distributed systems and their testing mechanisms have been widely explored for many years. There are many challenges inherent in distributed systems, such as latency of asynchronous communications, error recovery, clock drift, and service partitioning, leading to numerous problems including deadlocks, race conditions, and many other difficulties. Testing of such complex systems presents big challenges. Over the years, many automatic test generation, deployment, and execution methods have been investigated and implemented. However, great efforts are still demanded in the area of automatic system validation and verification.
Due to the complexity of event sequence control and test scenario explosion as the system scale increases, most testing methodologies are random using the model-based approach. In such cases, the verification cannot be coupled with particular actions and faults. Typical current models construct the graph of the states a distributed system could reach then conduct brute force verification, which results in state space explosion and difficulties in the high-level system abstraction. One example of a large distributed system is MICROSOFT™ SQL Azure. SQL Azure provides a cloud-based storage service that can store large amounts of data on a variety of actual physical hardware in distributed data centers. SQL Azure is a huge distributed system. In addition, the system scale dynamically changes in order to provide elastic storage.
Not only development but also testing of such a system is presented with great challenges. Traditional testing approaches might test functionality on one physical computer and rely on failure injection to test failover and long haul tools to introduce loads to the system. Tests can then check if the distributed system is healthy without errors, no partitions in abnormal states, no nodes that are down, and so forth. Investigating issues on a distributed system is a non-trivial process. When issues happen on the distributed system, the investigation involves specific domain expertise and knowledge. Too many traces may involve correlation and long investigation. Currently, event correlation is conducted manually which involves the correlation of a set of history tables. Without knowing the details of the related system components, it is difficult to track down a problem to the root cause. Monitoring of distributed system health is also a largely manual process. Limited auto-monitoring can check some factors, such as availability, service switch, and watchdog errors, but a person only checks other large amounts of metrics if something goes wrong; otherwise, the information is simply logged and ignored. In addition, some abnormal behaviors may not manifest themselves as or lead to obvious application errors, or may not be persisted in the way that auto-monitoring needs. Typical examples are in-memory states, unnecessary state transitions due to stale triggering events, and transient healthy state that could be ignored by manual checks.