Remote Monitoring and Management (RMM) systems are becoming increasingly popular for managing remote devices from a central server. They can be used to reduce the cost of system management, improve the quality of the environment, and improve the satisfaction of end users in receiving support. However, because of security issues, RMM systems typically have a delay inherent in their operation. This delay can interfere with the utility of the system in common circumstances. Reducing or eliminating the delay therefore makes the system much more powerful and valuable. RMM systems have historically used a variety of techniques to mitigate the delay.
One way to reduce the delay is to increase the polling rate of the remote devices in contacting the server. However, this does not scale very well. With even a modest number of remote devices, the server must be fairly powerful to handle the communication load, and a large amount of bandwidth is used for these frequent but largely unproductive connections. Both of these can greatly increase the cost of operation.
Other RMM systems have achieved improvements in interactive response by having the agent on each device maintain an open network connection to the central server. The agent then idles in a state where it is waiting for a response from the server, and the server can immediately initiate action by providing that response. However, this approach has many significant drawbacks, some of which are now listed.
Keeping open connections requires the use of ephemeral ports on the server. The most common implementation of this is Transmission Control Protocol (TCP), which provides about 64,000 such ports. In a large installation with an active server, this limit could easily be reached, preventing the ability to leave open connections active.
The implementation of TCP on the server must keep track of the long list of open connections being maintained by the agents. The TCP implementation is likely to slow down when the list gets long, increasing the overhead of network operations on the server.
Certain security vulnerabilities involve “hijacking” a TCP connection, which starts by “guessing” (or otherwise deriving) the state of the connection on the agent and the server. The longer a connection is open, the easier this “guessing” operation becomes, so connections that are open for extremely long periods of time are increasingly vulnerable to hijacking.
Some implementations of TCP may close an “idle” connection silently on the server side with no notification to the agent. If the definition of “idle” includes a connection that has been open for some period of time with no traffic, then this will happen on a regular basis with agents that leave a connection open for a long time. Since the agent will be disconnected but will still be waiting for a server response, the net result will be that the agent is unreachable for management.
In order to get around the previous limitation, some agent implementations use “keepalives” to maintain the connection. These keepalives are periodic transmissions that utilize the network but do not actually transmit any significant data. However, with many agents keeping connections open to the server, these keepalives can represent a significant fraction of the network bandwidth of the server, which wastes valuable resources on the server, both in network bandwidth and processing time in the network implementation.
Some RMM systems get around the previous limitation by using a proprietary protocol that has keepalive processing built in with reduced bandwidth requirements. These implementations often run afoul of the wide variety of firewall and network security tools in use, because those tools are not aware of the proprietary protocol and may act in such a way to block or disable it.
A minor interruption in network service on the server can close all the connections to agents. In a system that maintains these connections open for long periods of time, all of the agents will attempt to open new connections to the server after such an interruption. This will have an immediate and large impact on the server as it tries to service all of these requests.