The Real-time Transport Protocol (RTP) is a protocol for delivering audio and video media data over a packet switched network. RTP is used for transporting real-time and streaming media data, such as interactive audio and video. It is therefore used in applications such is IPTV, conferencing, Voice over IP (VoIP).
The Secure Real-time Transport Protocol (SRTP), specified in IETF RFC 3711 from March 2004, is a transport security protocol specified as a profile of RTP, which provides a form of encrypted RTP. In addition to encryption, it can provide message integrity, and replay protection, in unicast, multicast and broadcast applications. SRTP is used to protect content delivered between peers in an RTP session. SRTP is only intended to protect data during transport between two peers running SRTP. In particular, it does not protect data once it has been delivered to the endpoint of the SRTP session. In addition, the sending peer provides the protection by way of encryption of the media data, in other words it is assumed that the sending peer has knowledge of all keying material and is the one applying the protection of the data.
RTP is closely related to RTCP (RTP control protocol), which can be used to control the RTP session, and similarly SRTP has a sister protocol, called Secure RTCP (or SRTCP), also specified in RFC 3711. SRTCP provides the same security-related features to RTCP as the ones provided by SRTP to RTP.
Utilization of SRTP or SRTCP is optional to utilization of RTP or RTCP; but even if SRTP/SRTCP are used, all provided features (such as encryption and authentication) are optional and can be separately enabled or disabled. The only exception is the message authentication/integrity feature, which is indispensably required when using SRTCP. Confidentiality protection in SRTP and SRTCP covers the payload, while integrity protection covers the payload and the full packet header.
Many content delivery systems and communication services are based on store and forward mechanisms and require end-to-end confidentiality and integrity protection of media. In this scenario, media first traverses a first hop between a sender and an intermediate storage entity, and then (almost immediately or after some time) a second hop from the storage entity to a second entity. The second entity may be the intended receiver or yet another intermediate storage entity. Ultimately, however, the media is delivered to the intended receiver. However, each hop at an intermediate node (such as a Store and Forward Server) should be integrity protected. (The term “hop” is used herein to denote a logical link between two logically adjacent nodes in a network.) This is needed to allow an intermediate node to check the authenticity of media data packets arriving, for example where a mailbox or network answering machine stores media. This is necessary to protect against an attacker filling up the storage on the device with garbage. However, the keys necessary to decrypt the media or calculate/modify end-to-end (e2e) integrity protection should not be available to the intermediate node, to prevent the intermediate node from manipulating or having access to the plaintext media data.
A further issue is that an intermediate node (e.g. a voice mailbox) may handle messages, all directed to a specific recipient, but originating from several senders, and may therefore need to resend several stored and independently e2e protected streams together with media that is hop-by-hop protected. Additional problems may arise if the intermediate node locally generates media to be interleaved with the stored and protected streams. For instance, a voice mailbox may add its own voice-instructions to the end-user, e.g. “Press 4 to delete message”. This locally generated data should in general also be protected between server and end-user.
SRTP IETF RFC 3711 protects RTP and RTCP using cryptographic parameters stored in so-called cryptographic contexts. SRTP specifies that the cryptographic context of a media stream must be uniquely identified by a triplet context identifier:
Context id=<SSRC, destination network address, destination transport port number>, where SSRC is the RTP Synchronization Source.
For a given packet, it must be possible for the receiver to identify the context with which the packet should be processed. For this reason, part of the context identifier, namely the SSRC, is carried in-band in the RTP application, whereas the other parts, IP address and port, are “implicit” and provided by lower layers. The following description omits the implicit parts from the discussion in order to improve the clarity.
Media streams can be associated with such a context, said context containing keys and other security related data. The context may be determined using the Synchronization source (SSRC) used by a media data source node in e2e encryption direct to the receiver node (termed SSRC_e2e). A problem arises when media streams are sent via an intermediate node. Firstly, when sending data via an intermediate node that should not have access to the encrypted media, two types of keys are required; an e2e key and a hop-by-hop key, where the hop-by-hop key is used by each intermediate node to verify the integrity of the media data coming from the previous-hop node. However, this key should not be usable to decrypt the media data. When the intermediate node resends media data to a receiver, it may choose a new random SSRC and context identification fails. The SSRC used between the intermediate node and the receiver is then very likely to differ from the SSRC used by the original sender. Since the SSRC is used to identify the cryptographic context at the receiver, it is unlikely that the receiver could retrieve the correct context.
The above problems become even more obvious and complex when there are multiple media streams by one or more senders that, when forwarded to the destination, should be multiplexed into a single protected stream by an intermediate node. Even if the intermediate node is configured to use the original SSRC of each sender, choosing the context based on the SSRC_e2e may still lead to collisions, as they are chosen independently and randomly by the original senders who do not have to be synchronized in any way. Even with only a few SSRCs, the probability of collision is not negligible and reliable context identification will not be achieved.