VoIP—Voice Over Internet Protocol
Traditional phone calls are carried on fixed lines which guarantee a certain transmission quality. Alternatively, voice communications can be carried over packet-switched networks, such as the Internet, and the term “VoIP” refers to a set of protocols which are suited for delivery of such voice communications on the Internet. VoIP protocols use communication equipment (i.e. voice-mailboxes) to communicate over the Internet. Media in general, such as video, text messages or other data may be transported as well using the same VoIP protocols. Other similar terminology used in said respect is e.g. Internet telephony, IP telephony, broadcast telephony or voice over Broadband. The communication between two users or communication equipments is generally named a VoIP session.
VoIP has the advantage of reducing costs, since the communication is routed over existing data networks, and thus avoids the need of maintaining separate networks for traditional voice applications and data networks such as the Internet. Furthermore, VoIP facilitates tasks and provides services that may be more difficult to implement using the public switched telephone network, i.e. fixed lines. It is for example possible to conduct various calls over a single broadband connection. As already hinted at above, other services may be integrated with the normal phone call, such as video conversation, message or data file exchange during the conversation. A selling point for VoIP is that both data and phone calls can be carried on the same line, making the fixed line infrastructure redundant while also better utilizing the single packet-switched network.
On the other hand, VoIP has to deal with problems common to data transmissions via the Internet. In packet-switched environments like the Internet quality guarantees are harder to implement, and thus, quality of VoIP sessions may vary. Overloading the shared infrastructure with data may have a negative effect on the VoIP quality, e.g. increased latency, jitter or packet loss. As a result, monitoring of the VoIP quality is a crucial task for VoIP service providers.
Implementation of VoIP
VoIP employs session control protocols to control the set-up and the tear-down of calls as well as audio-codecs which encode/decode the speech signal, thereby allowing the transmission of the speech over an IP network as digital audio through a stream of media packets.
VoIP services may be considered to consist of a signaling plane and a media plane. On the signaling plane various protocols describe the session (call) flow in terms of involved parties, intermediary VoIP entities (i.e. VoIP proxies, routers) and the characteristics of the VoIP service (call). The media plane typically carries the media information (e.g. audio and/or video data) between the involved parties. Neither the media plane nor the signaling plane alone is sufficient to carry a VoIP service.
VoIP has been implemented in various ways using both proprietary and open protocols and standards, such as:                H.323        IP Multimedia System (IMS)        Media Gateway Control Protocol (MGCP)        Session Initiation Protocol (SIP)        Real-time Transport Protocol (RTP)        Session Description Protocol (SDP)        Real Time Streaming Protocol (RTSP)        Microsoft Media Services (MMS)        
On the signaling plane, protocols like SIP (see IETF RFC 3261, “SIP: Session Initiation Protocol”, available at http://www.ietf.org) or ITU-T recommendation H.323 (see H.323, “Packet-based multimedia communications systems”, Edition 7, 2009, available at http://www.itu.int) are commonly used. With regard to the media plane, protocols like RTP (Real-time Transport Protocol, see IETF RFC 3550, “RTP: A Transport Protocol for Real-Time Applications”, available at http://www.ietf.org), MSRP (see IETF RFC 4975, “The Message Session Relay Protocol (MSRP)”, available at http://www.ietf.orq) or ITU recommendation T.38 (see T.38, “Procedures for real-time Group 3 facsimile communication over IP networks”, Edition 5 (2007) or Edition 6 (2010), available at http://www.itu.int) are used. Other protocols for the media plane are RTSP (see Real Time Streaming Protocol), MMS (Microsoft Media Services protocol) or Real Audio PNM/PNA.
In the following, most of the description will assume that the Real-Time Transport Protocol (RTP) is used in the media plane to carry the media information. However, this is not to be understood as that the invention is limited to the use of RTP alone. Actually, any of the above-mentioned underlying protocols may be used to transport the VoIP media packets over the packet-switched network according to the invention. A skilled person is able to understand the differences between the various protocols and to adapt the embodiments of the invention according to the particulars of the protocols.
Underneath the signaling and media plane the standard protocols of the Internet Protocol Suite, such as IP, UDP (User Datagram Protocol) or ICMP (Internet Control Message Protocol) are used. FIG. 1 discloses an exemplary protocol stack for a VoIP communication, including for example RTP/RTCP in the session layer, UDP in the transport layer and IP in the network layer. Please note that optionally TCP could be used in the transport layer; however, for real-time traffic UDP is commonly used. On the session layer, it is assumed that SIP is used for session signaling. The protocol stack in FIG. 1 is shown only for exemplary purposes to better understand the following implementation examples.
In contrast to the traditional Public Switched Telephone Network (PSTN), both the signaling and media planes may be on different infrastructures, using different protocols and may even take different routes through a network. Where SIP is used for session setup, this is indeed the case, because during the session setup usually signaling-only devices participate, such as SIP proxies, REGISTRARs. This is illustrated in more detail in FIG. 2, which shows three geographically distributed POPs (Point of Presence). The POPs are linked over the carrier's internal network. The signaling data flows on the signaling plane between POP A and C via the signaling-only POP B, which would typically host a centralized routing entity, thus reducing the complexity in the POPs A and C. To prevent sending media traffic from POP A to POP B before sending it to POP C, the VoIP system is configured in a way that would allow the media to flow directly on the media plane between POPs A and C. Typical mid-point monitoring locations would be inside of each POP. Due to the layout of the network, no media traffic would be visible in POP B.
VoIP Quality Monitoring
Measuring Voice Quality
Today two main categories exist for measuring voice quality. The first method is called the subjective method, which involves real human test persons who express their opinion about their perceived voice quality. The average quality rating from all test persons is expressed as the Mean Opinion Score (MOS). The MOS score is expressed as an Absolute Category Rating (ACR) which defines a 5 point scale from 5 (excellent), 4 (good), 3 (fair), 2 (poor) to 1 (bad). An attempt for repeatable measurement results has been made by defining the ITU-T P.800 industry recommendation (see http://www.itu.int/rec/T-REC-P.800-199608-I), which provides normative speech samples to be used for the subjective test method. The results of the subjective test method are further separated in listening and conversational quality. This is expressed by further specifying the type of the MOS score:                MOSLQS (Listening Quality—Subjective)        MOSCQS (Conversational Quality—Subjective)        
Since the subjective method does involve human beings, the method is not suited to be automated by test equipment.
The second method for measuring voice quality is called objective method. This method has been designed for automated voice measurement by test and monitoring equipment. The goal of this method is to provide reliable, objective and repeatable measurement results for a voice quality rating that is similar to the subjective method performed by real human beings. Similar to the subjective method, MOS scores for listening and conversational quality are defined:                MOSLQO (Listening Quality—Objective)        MOSCQO (Conversational Quality—Objective)Intrusive and Non-Intrusive Monitoring of Voice Quality        
The objective MOS scores can be measured following two very different approaches. The first approach is an intrusive or active method, where the normative speech samples defined in ITU-T P.800 will be encoded by a VoIP sender, transferred over the packet based IP network and then decoded by the VoIP receiver. The MOS score is then calculated by comparing the known speech input signal from the VoIP sender with the received speech signal from the receiver. The method is called intrusive or active because the test signal is transferred in addition to eventually other VoIP traffic present on the network. Active VoIP monitoring can be used for VoIP readiness tests and prior deployments of a VoIP infrastructure because no other VoIP traffic is required, since the test equipment generates the test data used for measurement itself. Active testing has been defined by the industry recommendation ITU-T P.862 PESQ (see http://www.itu.int/rec/T-REC-P.862-200102-I) and ITU-T P.862.1 (see http://www.itu.int/rec/T-REC-P.862.1-200311-I). A benefit of this method is that all factors that can have an impact on VoIP quality are considered, like the VoIP endpoint, codec, noise, delay, echo and the effects of the IP network. The drawback of active testing is that real voice testing of real calls performed by real users is not measured. Because of the transient nature of VoIP impairments in IP networks, it is quite possible that the results of active testing do not reflect the quality experienced by real users.
The second approach is the passive, non-intrusive measurement method. With passive monitoring, real VoIP calls are measured so that no artificial traffic needs to be generated. The industry standards ITU-T G.107 E-Model (see http://www.itu.int/rec/T-REC-G.107-200904-P) and ITU-T P.564 (see http://www.itu.int/rec/T-REC-P.564-200711-I) define recommendations for passive monitoring of VoIP traffic in IP networks.
FIG. 3 provides an overview on the different measurement concepts, and where in the network they are being applied. In general, monitoring may happen anywhere in the network, where the monitoring component can only report quality impacts based on what traffic traverses its point in the network. Passive monitoring is measuring real VoIP calls without using a reference speech signal. This also means that deployment of passive monitoring solutions is often easier, because ideally only one location has to be visited. Since the speech payload of live calls is unknown, only those statistics/metrics can be considered that are independent from the speech payload. Mainly, these metrics are loss and jitter.
A minimal non-intrusive, passive monitoring system is made from a monitoring probe and a test access port (TAP) connected to the network to be tested and optionally made of a post-processing platform to visualize the measurement results of the monitoring probe. A TAP is a passive network device, which can mirror network traffic without interference of the original network traffic, by creation of a copy of every IP packet. It provides a copy of every packet sent or received on the network, by separating the full-duplex network link into two half-duplex network links, which are then connected to a specialized packet capture card (network interface card—NIC) installed in the probe. These specialized packet capture cards are capable of receiving and processing packets on the physical interface and to provide them to the application layer, nearly without requiring CPU processing time and operation system functionality. Instead of using a TAP to obtain a copy of VoIP packets, the monitoring probe component may be co-located in a live network entity, i.e. Session Border Controller or endpoint. The live component may send a copy of the relevant packets to the monitoring component eliminating the need for a passive TAP.
FIG. 4 shows an exemplary monitoring system of a passive, non-intrusive monitoring solution deployed in a VoIP network. FIG. 4 indicates possible mid-point monitoring locations (TAP positions) for the monitoring probe within a carrier network. Optionally, multiple monitoring probes can be deployed in the network so that RTP streams can be evaluated end-to-end. Furthermore, the impact of installed network hardware like an SBC or media-gateway on the (RTP) stream quality can be analyzed. Having multiple midpoint monitoring locations allows segmenting the data path between VoIP endpoints into smaller parts. Quality observations and possible degradations can thus be narrowed down to a specific network segment.
As mentioned above, passive non-intrusive monitoring solutions for VoIP traffic are based on packet flow analysis of (e.g. RTP) streams, which are used to transfer speech over IP networks. This analysis can be performed either as an integral part of a VoIP device like an IP-phone, media-gateway or at mid-point somewhere in the network between the two parties. Both approaches have advantages and disadvantages.
If the analysis is integrated into a VoIP device, additional important information becomes available to the packet flow analyzer like the size of the jitter buffer, or like information on whether received packets are considered for further processing or are discarded due to late arrival (large jitter). The availability of this information can be a major advantage in accurately estimating the VoIP quality for the end user of the device. Disadvantages are that devices may only have a limited view on the full VoIP service (e.g. an IP-phone) because only the incoming or outgoing calls will be subject to monitoring. Data gathered by a single endpoint is not available to a VoIP provider wide monitoring entity. All other VoIP traffic directed to other end points would be unavailable, unless the flow analysis is integrated into every IP-phone, which is practically hard to achieve. Another disadvantage is that VoIP devices are service specific hardware with limited performance and resources available for additional data processing for which they have not been designed. Packet flow analysis can be a very CPU intensive task, and the results have to be stored somewhere. CPU resources and disk space is something that is not sufficiently available on IP-phones or media-gateways.
Because of these limitations, a monitoring solution based on passive mid-point monitoring as shown in FIG. 4 may be advantageous. In said case, monitoring is performed on copies of the network traffic, which is produced by a network TAP to which the monitoring probes are connected. Thus, the quality measurement doesn't have any impact on the real network traffic and is independent from hardware and manufacturer, while at the same time being able to produce an estimate copy of all live calls being transferred at the network location under test.
The structure of a single VoIP packet in this setup is exemplified in FIG. 5. It should be noted that the header and payload size are variable. The protocols employed for VoIP communication over packet-switched networks are based on the ISO (OSI layer model, see FIG. 1) and include different protocols on the various layers of the model. On the lowest relevant layer the IP transport protocol is used. On each layer the fixed or variable length header is followed by the payload, which contains the packet header of the next protocol and the protocol specific payload.
The VoIP packet as received at the network interface by the monitoring probe may consist of the MAC header (Ethernet, ATM or other MAC protocol header, for example), the IP header, the UDP header and a RTP header followed by the actual payload, being the speech data. The corresponding IP-, UDP and RTP headers are depicted in FIGS. 7, 8 and 9 respectively.
Internet Protocol (IP) packets (see FIG. 7) are the basis of Voice over IP communications. Independent of the underlying physical connection, all data is transported in IP packets. Each IP packet contains a source and destination address to identify the source and destination of the particular packet in the IP address space. Each VoIP endpoint has at least one IP address. Multiple VoIP sessions may share the same IP address, though. In that case higher level protocol information is used to distinguish between packets for various IP endpoints. The IP packet contains a “protocol” field which contains a numeric value representing the payload type transported in this IP packet. The numeric fields' values are standardized. For instance, the numeric value 17 identifies UDP packets and 1 is assigned to ICMP.
An IP packet with the protocol type 17 contains a UDP packet including a UDP header and payload (see FIG. 8). The UDP packet header extends the source and destination address specified on the IP layer by adding a source and destination port for each packet. UDP has been designed to be lightweight and does not include features which could be found in other protocols such as the Transmission Control Protocol (TCP). As a result, it only contains an additional length and checksum field but no payload type.
Throughout most of this document, the RTP is used as an exemplary protocol for media transport over IP based networks, though other suitable protocols including RTSP and MMS may be used in the same way. Each RTP packet (see FIG. 9) contains a variable length header and a variable length payload. Most notably, the first 12 bytes of the RTP header are fixed and contain a synchronization source (SSRC) identifier. This SSRC value is a 32-bit numeric identifier which is contained in every single packet and deemed mandatory by the RTP protocol. It uniquely identifies an RTP media stream.
Shortcomings of VoIP Midpoint Quality Monitoring.
In the example of FIG. 6, the monitoring component observes the data exchanged between end customer A and end customer B. The monitoring takes place between the end customer A and the router C. A VoIP quality monitoring device placed on the network path between VoIP-enabled devices can observe the traffic it sees, collect quality indicators and report on the quality observed. Alternatively, the VoIP quality monitoring device need not be located on the network path between the two communicating devices, but in that case would be supplied with the data traffic by e.g. a TAP device located on the network path between the two communicating devices. In any case, the monitoring equipment receives the single data packets and must associate the various different data packets to streams of packets belonging to the same VoIP session.
The monitoring component may see packets of a certain media stream (1) directed to end customer B. The stream originates from endpoint A and passes through the monitoring point and the router C on its way to end customer B. The network carrying the same stream between the router C and end customer B (depicted as stream (2)) cannot be observed by the monitoring point. The monitoring component can diagnose the VoIP session, however at this point does not know whether the media stream made it to the destination (i.e. end customer B) with the same quality or whether the stream made it all because it only observed packets from media stream (1). Furthermore, the VoIP quality monitoring system of FIG. 6 cannot make any assumptions on stream (2) between router C and end customer B based on the media stream it previously observed.
There can be various reasons why the quality of stream (2) degrades or even why it may not reach its destination (i.e. customer B). The VoIP quality might degrade due to a congested local network in which the customer B is located. Or, the customer B may be located behind a firewall which blocks stream (2) from reaching customer B. In all those cases, the monitoring probe would not be aware of these errors and would still measure a good VoIP quality based on the received media stream (1).