1. Field of the Invention
The present invention relates, in general, to communications and data transfer among computers and network nodes, and, more particularly, to software, hardware, and computer systems for analyzing communication performance between applications, such as telemetry data generation and reception/monitoring applications, running on networked computing devices and for managing data transfer between the applications to provide enhanced performance or maintain desired levels of performance.
2. Relevant Background
In today's society, a huge amount of digital data is transferred over communications networks that may be made up of local area networks (LANs), wide area networks (WANs), intranets, the Internet, other communication channels and networks, and any combination of such networks. For network designers and operators and for those using these networks for the transfer of their messages and data, an ongoing and difficult problem is how to control communications over these complicated networks to obtain not only predictable and secure communications but also so as to achieve the most prompt delivery of the message or data. In other words, it is often important that the information transmitted from an application or computing device be received in a timely manner by another application or computing device.
There are numerous sources of latency (i.e., time delay or the time it takes to get information through a network) and/or slow throughput in digital communications networks. Often congestion may occur in the middle transport or public network portion of the network between two communicating applications. However, congestion also may occur on the portions of the network that are within the control of the entities operating computer device or node running the source application and/or the target destination application. For example, congestion may occur on the LAN segment of the source device or occur on the WAN segment to which the source device connects (e.g., a pre-existing high utilization condition on a WAN link or the like). Additionally, the source network may use a WAN protocol that introduces latency such as would be the case if a high-overhead or high-correction protocol (e.g., the X.25 protocol). Further, general WAN latency may be experienced by a connection or channel selected by the source application due to multi-hop Internet pathways, committed information rates of variable network types such as Frame Relay Committed Information Rate (CIR), efficiency of connection based on values such as Maximum Transmission Unit (MTU), and fragmentation and re-transmits experienced across a WAN path. Similarly, congestion and latency may be introduced within networks and/or communication channels under the control of operators or entities maintaining a destination network. For example, congestion may occur on the WAN segment of the target destination network or on the LAN segment of the destination network and/or host/system. Other causes of latency may be directly related to the source or destination application such as high operating system layer utilization bottlenecks on generating or processing a message or such as performance issues related to a particular source or target application.
Controlling or limiting latency and slow throughput on a network may be important in many cases where two applications need to communicate over a network. For example, a number of companies have developed systems in which they monitor operating computer systems by gathering telemetry data at a host or source system or network with a telemetry generation application, transferring this data over a communications network to another computer device linked to the network, and using a monitoring/analysis application to process the received telemetry data. These systems may be thought of as telemetry systems that collect and store telemetry data on behalf of their customers. During operation, the telemetry systems monitor generated or incoming data streams in real time for significant events and analyze the data using complex pattern recognition and statistical analysis formulas to predict possible faults. The nature and definition of telemetry data may vary but typically includes alarm messages or utilization statistics for which message sizes are typically only a few hundred bytes and may include large quantities of system configuration data that is transmitted in messages whose payloads can easily be several megabytes in sizes. A useful definition of telemetry data or messages may be any data or messages that may be polled, received, or analyzed regardless of its size that may provide benefit in terms of maintaining and/or increasing availability or performance of a particular computer device or system, e.g., any data collected from a source system by a source telemetry application for use in monitoring and/or analyzing performance of the source system.
In telemetry systems, telemetry data is considered time-sensitive data, and it is generally desirable to provide the fastest possible collection of the data at the source system and delivery of the data to a telemetry analysis system (e.g., an analysis application running on a node or device linked to a network). Ideally, a telemetry connection channel used to communicate the telemetry data between the source and analysis system (e.g., telemetry source and destination) should have adequate bandwidth, low latency, and zero or very low downtime. The value of the telemetry data collection process diminishes rapidly as delivery time to the destination increases. For example, for an online retailer, discovering that a critical event occurred in their environment or computer network/system and is having a financial impact (e.g., buyers cannot complete purchases and the like) to their revenue stream is highly valuable data that needs to be put to immediate use to correct a problem. In this case, a delay of even a few minutes or seconds may mean many lost sales, irritated customers, or worse. In another example, complex predictive modeling algorithms that are used in telemetry analysis may produce differing results if one or more data points are lost or delayed. This may result in a significant failure or operating problem in a computer system not being predicted prior to its occurrence or prior to a time when it may be prevented. In these and other similar application-to-application communication environments, the fastest possible delivery to data is often a critical factor in being able to use the data in a meaningful way.
Existing communication techniques generally involve a source application generating a message or data payload, selecting a source for the message, and transmitting the message with its data payload over a communications network. The source application has no control over the latency, bandwidth, and availability of the communication channels used to transmit the generated message. Hardware solutions are sometimes implemented by building LANs, WANs, and connections that provide desired latencies and bandwidths. However, congestion may still occur in such networks, and differing communication paths in the LANs, WANs, and connections between the source application and the connection to the middle transport such as the Internet may have differing transmission characteristics such as differing bandwidth and latency that result in data transmitted on such communication paths reaching the destination or target at differing throughput rates. Similar differences in throughput rates may occur at the destination or target system such as between a connection to the middle transport and the destination (e.g., in the destination LAN, WAN, or the like). Efforts have been made to enhance data transfer within the middle transport such as between routers. Such efforts typically include complex algorithms and counters implemented at lower layers of the data transfer protocol stack (e.g., in the network layer of the TCP/IP protocol stack). While improving communication rates and reliability within the middle transport such as the Internet, these efforts may still result in application-to-application communications varying significantly and having undesirable delays in message receipt by a target or destination application such as when a communication channel is out of service or when there is a problem within a source or destination system rather than in the public network.
Hence, there remains a need for improved methods and systems for optimizing network communications between two applications. Preferably, such methods and systems would be particularly well suited for analyzing and optimizing telemetry streams or telemetry signals transmitted from a telemetry generation application to a destination analysis application to enhance real time analysis of telemetry data.