One of the primary responsibilities of service providers is to ensure that their services provide a level of performance and robustness that satisfies the commitments specified in their service level agreements (SLAs) with customers. A standard approach is to monitor the quality and behaviour of the services by measuring system-internal performance characteristics (such as round trip delay, available bandwidth and loss ratio), identify unusual or anomalous activity that either directly indicates or indirectly implies that the service is no longer behaving satisfactorily. These measurements allow for detection of quality degradation or functional loss. Additionally a service root cause analysis (S-rca) can be used to analyze (root) causes of service performance degradations, in order to identify the reason for a fault that resulted in the quality degradation or functional loss.
To obtain this measurement information a service assurance function must rely on detailed event reporting from network resources, particularly of measurement events.
As network nodes generate a massive number of measurement events, it is impractical to collect all event data in a central database for future correlation and analysis. Subsequently, intelligent filtering and aggregation must be applied to reduce the amount of data, while still allowing for drill-down.
In terms of measurements, numerous measurement systems have been proposed and implemented. One way to classify measurement methods is to distinguish between active and passive approaches.
Active measurements involve injection of traffic into the network in order to probe certain network devices (such as PING) or to measure network properties such as round-trip time (RTT), one-way delay and available bandwidth. The results from active measurements are generally easy to interpret. However, the injected traffic may affect the network under test.
Passive measurements, either software-based or hardware-based, simply observe existing network traffic and are non-intrusive, or at least provide very little intrusion into the network under test. Network traffic may be tapped at a specific location and can be recorded and processed at different levels of granularity, from complete packet level traces to statistical figures. The results from passive measurements are often hard to interpret but have the benefits of not affecting the network under test.
Measurements can also be performed on different system/protocol layers, for example following an Open Systems Interconnection (OSI) model, including link layer, network layer, transport layer and even application layer. Existing measurement systems mainly consider network and transport layers due to privacy and legal concerns.
Measurements collected on different layers may present varied levels of granularity, from complete packet level traces to statistical figures. Measurements with the coarsest granularity are traffic counters, i.e. cumulated traffic statistics, such as packet volume and counts. Another common practice is use flow level statistics from NetFlow (Cisco) and sFlow, containing traffic volume information of a specific flow.
Despite network-wide measurement and performance estimation, the measurement systems known in the art usually take little consideration on compatibility or inter-operability. These systems are usually stand-alone, use different performance metrics, employ various underlying measurement mechanisms, and often operate off-line only. Though diverse in underlying mechanisms, these systems have the common goal of providing system-internal characteristics to applications, and their measurements overlap significantly.
There are various disadvantages with the existing solutions. For example, the existing solutions do not take into account that the network equipment has implicit knowledge about relations between measurements related through their Resource Service (ReSe) relation. Further, existing solutions continuously process all measurements to capture relations and aggregate measurements therefore loosing valuable information that could be used for trouble shooting. Furthermore, existing solutions rely on unstructured network measurements and thus try to make the best out of the situation. The lack of meta data in counters makes it very hard to correlate measurements from different resources especially on session level.