High-volume throughput-centric systems include a large class of applications where requests or processing tasks are generated automatically in high volume by software tools rather than by interactive users, e.g., data stream processing and search engine index update. These systems are becoming increasingly popular and their performance characteristics are radically different from those of typical online Web applications. Most notably, Web applications are response time sensitive, whereas these systems are throughput centric.
Performance control for online interactive Web applications has been a focused research topic for years, and tremendous progress has been made in that area. By contrast, relatively little attention has been paid to performance control for a large class of increasingly popular applications, where requests or processing tasks are generated automatically in high volume by software tools rather than by interactive users. Many emerging stream processing systems fall into this category, e.g., continuous analysis and distribution of news articles, as that in Google Reader™ and System S™.
Moreover, almost every high-volume interactive Web application is supported behind the scene by a set of high-volume throughput-centric processes, e.g., Web crawling and index update in search engines, Web log mining for Web portal personalization, video preprocessing and format conversion in YouTube™, and batch conversion of rich-media Web sites for mobile phone users.
Beyond the Web domain, additional examples of high-volume throughput-centric systems include IT monitoring and management, overnight analysis of retail transaction logs, film animation rendering, robot trading in electronic financial markets, scientific applications, sensor networks for habitat monitoring, network traffic analysis, and video surveillance.
The workload and operating environment of these high-volume throughput-centric systems differ radically from those of session-based online Web applications. Most notably, Web applications usually use response time to guide performance control, whereas high-volume throughput-centric systems are less sensitive to response times of individual requests, because there are no interactive users waiting for immediate responses of individual requests. Instead, these systems benefit more from high throughput, which also helps lower average response time and hardware requirements.
Computer systems for information technology (IT) monitoring and management belong to the category of high-volume throughput-centric systems. Today's enterprise information technology environments are extremely complex. They often include resources from multiple vendors and platforms. Every hardware, operating system, middleware, and application usually comes with its own siloed monitoring and management tool. To provide a holistic view of the entire IT environment while taking into account the dependencies between IT components, a federated IT Service Management (ITSM) system may use a core event-processing engine to drive and integrate various siloed software involved in IT management.
An IT event broadly represents a piece of information that need be processed by the ITSM system. For instance, under normal operations, transaction response times may be collected continuously to determine the service quality. Monitoring tools can also generate events to report problems, e.g., the database is down. When processing an event, the event-processing engine may interact with various other components in the federated ITSM system, e.g., retrieving from a remote database the profile of the customer affected by the outage, invoking an instant messaging server to notify the system administrator if a VIP customer is affected, or generating in the service portal a trouble ticket to be handled by service personnel if automated remedy failed.
When a major IT component (e.g., core router) fails, the rate of IT events may surge by several orders of magnitude due to the domino effect of the failure. If the event-processing engine tries to process all events concurrently, either the engine itself or some external programs working with the engine may become severely overloaded and suffer from thrashing.
Controlling performance in such systems is difficulty to achieve, for example, because in a federated system having components from different vendors, performance control can only take a black-box approach that does not require intimate knowledge of the internal implementation details of every component. Furthermore, there are no simple performance indicators to guide tuning, such as packet loss in TCP or response time violation in interactive Web applications.
In light of today's complex and heterogeneous IT environments, the success of an ITSM product to a great extent owes to its capability that helps integrate various distributed data sources and siloed monitoring and management tools. Because of the diversity of these external programs working with the product, assumptions cannot be made that are presumed by existing performance control algorithms. For instance, it cannot be assumed that an IT product can remotely track the resource consumption of every external program. It cannot be assumed that the source of the performance bottleneck is always the IT product instead of an external program. It cannot be assumed that CPU is always the bottleneck resource. It cannot be assumed that every external program has its own overload protection mechanism. It cannot be assumed that the IT solutions share a common static topology. Therefore, online performance controllers based on static queuing models are not always suitable.