Many different enterprises run complex networks of servers to implement various automated communication functions to the enterprise. For example, as mobile wireless communications have become increasingly popular, carriers such as Verizon Wireless have customer communication systems to provide notifications of account related activities to their customers, for example as SMS messages to account holders' mobile stations, as emails, etc. Because of the large number of customers served by a major carrier, and the level of account activities, the volume of notification message traffic is quite large. To effectively provide such notifications, Verizon Wireless implemented its Customer Communication Enterprise Services (CCES) as an enterprise middleware web service.
At a high level, the CCES middleware comprises a web server layer and an application server layer. The architecture allows clients to send a request, for example for a notification, to a web server. The http web server then forwards the client request to one of a number of application servers. Each application server has multiple applications running on it. The application server determines the proper application to process the client request based on the context root of the client request. The application server processes the client request, in the CCES example, by sending one or more request messages to a back end system such as the Vision Billing System, MTAS, the SMS gateway and others, for example, to implement account activity and to initiate subsequent automatic notification thereof to the account holder. Once the application server has processed the request, a reply is then sent back to the web server which will then forward the reply back to the client.
In such an arrangement, the web server will keep a process thread open for each client request until the reply is sent back to the client from which the web server received the request. A problem occurs whenever any application server or backend process is slow. When this happens, the http web server starts creating new threads faster than it can close older threads, which causes the total number of threads to climb. Since an http server has the capacity to keep open only a finite number of threads, eventually the web server reaches its limit and an outage can occur.
It is fairly easy to monitor the number of open threads on an http server, but it is a challenge is to find out why there are a high number of threads. For example, each http server is shared by multiple applications. Thus if you know that a particular web server has a high number of threads open, you still do not know what application is causing the problem. Also, each http server sends the requests to two or more application servers in a round robin fashion. Thus, the support technician or system would not know the path of a request even if the http server that processed the request can be identified. Furthermore, often the slow down is caused by problems of one of the downstream backend systems, such as the Vision billing system, which the CCES support person does not have direct access to. Also, often the best way to diagnose a problem with a backend system is to perform a thread dump of the application that is having a problem. The thread dump provides an image of the open threads, for analysis by the support person. However, the thread dump must be done during the slowdown with the backend system in order to be effective. In a typical CCES example, each application server may have over twenty applications running on it. Since problems with backend systems can sometimes last only a few minutes, it is very challenging for a CCES support person to determine the correct application and run the thread dump while the problem is occurring.
These conditions or challenges create difficulties in pinpointing a problem, particularly in identifying the specific application on a server that is shared by multiple applications and identifying the specific server request path when the architecture is configured to distribute the load in a round robin fashion. There is no access to downstream systems such as the Vision billing system. Also, the short time window to perform any thread dumps in real time during a system slow down period.
Hence there is room for improvement in monitoring thread count in web server middleware systems, such as CCES, to address one or more of the above discussed challenges.