1. Field of the Invention
The present invention relates to an improved data processing system and, in particular, to a method and system for multiple computer or process coordinating. Still more particularly, the present invention provides a method and system for network management.
2. Description of Related Art
Technology expenditures have become a significant portion of operating costs for most enterprises, and businesses are constantly seeking ways to reduce information technology (IT) costs. This has given rise to an increasing number of outsourcing service providers, each promising, often contractually, to deliver reliable service while offloading the costly burdens of staffing, procuring, and maintaining an IT organization. While most service providers started as network pipe providers, they are moving into server outsourcing, application hosting, and desktop management. For those enterprises that do not outsource, they are demanding more accountability from their IT organizations as well as demanding that IT is integrated into their business goals. In both cases, “service level agreements” have been employed to contractually guarantee service delivery between an IT organization and its customers. As a result, IT teams now require management solutions that focus on and support “business processes” and “service delivery” rather than just disk space monitoring and network pings.
IT solutions now require end-to-end management that includes network connectivity, server maintenance, and application management in order to succeed. The focus of IT organizations has turned to ensuring overall service delivery and not just the “towers” of network, server, desktop, and application. Management systems must fulfill two broad goals: a flexible approach that allows rapid deployment and configuration of new services for the customer; and an ability to support rapid delivery of the management tools themselves. A successful management solution fits into a heterogeneous environment, provides openness with which it can knit together management tools and other types of applications, and a consistent approach to managing all of the IT assets.
With all of these requirements, a successful management approach will also require attention to the needs of the staff within the IT organization to accomplish these goals: the ability of an IT team to deploy an appropriate set of management tasks to match the delegated responsibilities of the IT staff; the ability of an IT team to navigate the relationships and effects of all of their technology assets, including networks, middleware, and applications; the ability of an IT team to define their roles and responsibilities consistently and securely across the various management tasks; the ability of an IT team to define groups of customers and their services consistently across the various management tasks; and the ability of an IT team to address, partition, and reach consistently the managed devices.
Many service providers have stated the need to be able to scale their capabilities to manage millions of devices. When one considers the number of customers in a home consumer network as well as pervasive devices, such as smart mobile phones, these numbers are quickly realized. Significant bottlenecks appear when typical IT solutions attempt to support more than several thousand devices.
Given such network spaces, a management system must be very resistant to failure so that service attributes, such as response time, uptime, and throughput, are delivered in accordance with guarantees in a service level agreement. In addition, a service provider may attempt to support as many customers as possible within a single network management system. The service provider's profit margins may materialize from the ability to bill the usage of a common network management system to multiple customers.
On the other hand, the service provider must be able to support contractual agreements on an individual basis. Service attributes, such as response time, uptime, and throughput, must be determinable for each customer. In order to do so, a network management system must provide a suite of network management tools that is able to perform device monitoring and discovery for each customer's network while integrating these abilities across a shared network backbone to gather the network management information into the service provider's distributed data processing system.
Hence, there is a direct relationship between the ability of a management system to provide network monitoring and discovery functionality and the ability of a service provider using the management system to serve multiple customers using a single management system. Preferably, the management system can replicate services, detect faults within a service, restart services, and reassign work to a replicated service. By implementing a common set of interfaces across all of their services, each service developer gains the benefits of system robustness. A well-designed, component-oriented, highly distributed system can easily accept a variety of services on a common infrastructure with built-in fault-tolerance and levels of service.
Distributed data processing systems with thousands of nodes are known in the prior art. The nodes can be geographically dispersed, and the overall computing environment can be managed in a distributed manner. The managed environment can be logically separated into a series of loosely connected managed regions, each with its management server for managing local resources. The management servers coordinate activities across the enterprise and permit remote site management and operation. Local resources within one region can be exported for the use of other regions in a variety of manners.
As with most organizations, an IT organization, such as a service provider, may classify its personnel into different managerial roles. Given a scenario in which a service provider is using an integrated network management system for multiple customers, it is most likely that many different individuals will be assigned to manage different customers, different regions, and different groups of devices. In addition, separate individuals may have different duties within similar portions of the network, such as deploying new devices versus monitoring the uptime of those devices, and the management system should be able to differentiate the roles of the individuals and to restrict the actions of any particular individual to those operations that are appropriate for the individual's role. In a highly distributed system comprising on the order of a million devices, the task of authenticating and authorizing the actions of many individuals per customer, per region, per device, etc., becomes quite complex.
Typically, the network management system configures a security context for the network administrator, for example, by assigning a specific class of privileges to the administrator based on the administrator's role or group. Within a runtime environment, the administrator can attempt to perform certain administrative functions that are not allowed to be performed by the average user. For example, a particular administrator may be responsible for maintaining a particular subnet of devices with a range of IP addresses between X.Y.Z.0 and X.Y.Z.255. At some point, the administrator may log into the management system and attempt to ping a device at address X.Y.Z.5 in order to determine whether the device is operational. The network management system must determine whether to allow the administrator to perform the ping function on that particular device.
In some systems, the authentication and authorization process occurs solely at the time of the login operation. If the administrator has been authenticated and is authorized to use a particular application, then the system allows the application to perform all of its actions without further authorization checks. With this type of system, however, it is difficult to achieve the granularity of control that is necessary for a service provider with multiple customers within a highly distributed system in which different sets of administrators have responsibility for different customers.
In other systems, various resources may be protected individually, regardless of any system-wide authentication or authorization processes. These resources may be protected by providing a set of security application programming interfaces (APIs) through which the resource is accessed. For each action requested to be performed on a resource, the system is required to perform an authentication process, an authorization process, or both. Hence, the system has a significant amount of overhead because, for each API invocation, some form of security process must be completed. However, within a system that performs network management tasks for a million devices or more, it would be quite difficult to have security enforced in this manner because a tremendous amount of computational resources throughout the system would be consumed for the managerial functions. For example, function calls would be constantly blocking to wait for a security function to complete, and significant network bandwidth would be consumed by status messages throughout the system.
Therefore, it would be particularly advantageous to provide a method and system that provides a flexible polling and monitoring scheme associated with network management tasks in a highly distributed system. It would be particularly advantageous for the network management system to provide network administrators with the ability to create adaptive polling and monitoring tasks that are based upon a network administrator's user security context.