Large scale computing systems such as those found in network-based production services have become widely available in recent years. Examples of these systems are on-line retail, on-line internet service providers, on-line businesses such as photo processing, corporate networks, cloud computing services and/or web-based hosting services. These businesses may have multiple computing devices (e.g., thousands of hosts) in geographically separate locations configured to process millions of client requests daily or even hourly, for example. Ensuring that these services can scale to handle abnormal loads (e.g., client requests) is a non-trivial problem. Instead of testing an actual production system, software testers usually create a scaled-down copy of a production system with a smaller number of hosts and test with a smaller, proportional load. Other approaches include component-level stress tests where a single component of the architecture is targeted with each tests. In some instances, software testers will test with engineered data that has no relationship to actual production data. Creating engineered data out of thin air requires the tester to have some knowledge of the production data patterns in order to build a model. Furthermore, the model is not guaranteed to simulate production traffic in a realistic manner. For example, randomly generated data may not accurately reflect relationships between transactions (e.g., user X takes steps a, b, and c in the website with specific latency; these would be separate transactions but there is a specific relationship between them not reflected in generated data).
Additionally, using real-world data on a large scale stress test is also challenging. For example, using production data may prevent testing potential or expected situations. What if right now 30% of the transactions are of type X, but this is expected to go up to 80% in the future? Using existing production data would not test this scenario. Furthermore, existing test solutions are not scalable to handle storing, accessing, processing and/or applying a load to test at the size of today's large production systems. As a further complication, it may be desirable to test for some time periods having loads that are many times the load of other time periods. For example, a business may want to test how a network site will handle increased traffic during a time period for which the business is advertising a special promotion, or test how a retail website will handle a volume of traffic expected on peak shopping days (e.g., Black Friday or Cyber Monday). Testing with the current level of production data would not test the increased traffic scenario.
Testing a large scale network with the methods described above often misses problems that would only surface on a higher scale or that is only available in the production system. Even when production data is used, it is difficult to model potential situations, especially those that expect a change of the mixture, or ratio of the transactions from the current production data. Additionally, the methods described above for testing components individually, for example, may not encounter issues that are found only through the interaction between subcomponents in a system. This may lead to outages in the production system that affect business revenue and degrade the customer experience.
While the invention is described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention. Headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description.