Operating and maintaining complex world-wide web (WWW) software applications, hereafter termed web applications, has proven to be quite complex. A web application is commonly described as any application that uses a web browser as a client. In some instances a web application may also refer to a computer software application that is coded to execute in the context of a browser. Typically there is a server component that provides information to the web-based client utilizing network connections. A distributed web application refers to a system that utilizes information provided by servers in a distributed or multi-tiered configuration.
Experience from operations personnel, software developers and systems administrators indicates that the ability to determine the status of a complex web application as well as problem resolution is enhanced by the use of relevant and timely data. The source of this data comes from metrics gathered from within application components and system infrastructure. Metrics are the measure of the efficiency and effectiveness of an operation or process.
The status of a web application represents the ability of the application to meet certain criteria, such as its service level agreements. A service level agreement may be a formal written agreement, or it may be informal expectations of an application to perform in a certain manner. If the application is expected, by way of example, to display a web page with specific information in less than four or five seconds, then the status of the web application is the measurement of that application's ability to display the web page in the expected time frame.
Problem resolution of a web application refers to the process of identifying that the status of a web application indicates that the application is not meeting its' service level agreements and taking action to correct any related issue.
The definitive set of metrics that are relevant to a distributed web application are not well understood. A fundamental issue stems from the fact that technologists rarely know what specific information will be required for problem resolution before it is required. To this end, it is preferable to gather and store a large set of information in case it is needed.
A system that provides the relevant set of metrics and information required for operation of a distributed web-based software application would have to collect data from the executing software application, aggregate the data, create summaries of the data, store the data and provide the data in a timely manner. In a design where data is collected from remote servers and stored in a central location, a constant stream of data would be created, originating from remote servers.
A system designed to collect very detailed operational data from remote data sources requires a significant quantity of computer resources to accomplish this task; and the load on such a system is directly related to the number of remote data sources being processed. As the number of remote data sources increases, a system processing the data from these sources needs to scale to handle the increase in load.
Prior art approaches cause a system for processing operational data from remote data sources to be scaled out by expanding the first system. In this manner a first system itself is extended by adding more compute resources to the first system. In this manner, the issue of scale is solved by adding additional processing power, storage and network bandwidth as needed to the first system.
In the past, several techniques have been utilized to accomplish scaling-out of a first system. Clustering of commodity PC hardware is a technique used to scale-out a first system. A paper entitled “High Performance Cluster Computing: Architectures and Systems, Volume 1” (Rajkumar Buyya (editor), ISBN 0-13-013784-7, Prentice Hall PTR, NJ, USA, 1999) provides an overview of cluster technologies and approaches. Sharding is another approach to scaling out a first system. A sharding approach results in distributing remote data sources across different data handling processes, or shards, such that each shard can only handle a set number of remote data sources, and as remote data sources increase more shards are added. A paper entitled “Scalable Web Architecture and Distributed Systems” (Kate Matsudaira, “The Architecture of Open Source Applications”, http://www.aosabook.org/en/distsys.html) provides a detailed discussion of the use of sharding techniques.
There are limits to scaling a system by simply adding additional resources to that system. The software that is processing the operational data must be designed in such a way that it can be scaled. Additional hardware resources can be added to the system that needs to be expanded, but if the software is not developed to support scaling, the additional hardware will not result in a required increased capacity of that system.
Scaling-out of a system that processes operational data from remote data sources by adding additional compute resources results in the problem of designing for unlimited scale. Software developers are required to create systems that support changing load requirements. It becomes quite difficult to validate that a system functions properly if the load presented to such a system, in the form of remote data sources, is ever-increasing.
A method is required to support increased capacity of a system that processes operational data from remote data sources in such a way that the software supporting such a system does not have to be designed for infinite scale. This invention defines a method for replication of a first system as opposed to a scale out of a first system. This replication allows a system to support ever-increasing capacity without the need to design for infinite scale.