SIP (session initiation protocol) is an internet protocol that supports creation, modification and termination of sessions with one or more participants. SIP is used for voice and video calls, either for point-to-point or multiparty sessions. It is independent of the media transport which for example, typically uses RTP (real-time transport protocol) over UDP (user datagram protocol). SIP is also used for Instant Messaging and presence detection. SIP allows multiple end-points to establish media sessions with each other. It supports locating the end-points, establishing the session and then, after the media session has been completed, terminating the session. SIP has gained widespread acceptance and deployment among wireline service providers introducing new services such as VoIP (voice over internet protocol), within the enterprise for use in Instant Messaging and collaboration applications and among mobile carriers providing push-to-talk services. Industry acceptance of SIP as the protocol of choice for converged communications over IP networks is wide ranging.
As shown in FIG. 1, a SIP infrastructure consists of clients 10A and 10B, SIP proxies 12A-12C, and domain directory servers 14A and 14B deployed across domain networks 16A and 16B and network 18 (e.g. the internet). A client is a SIP endpoint that controls session setup and media transfer. A client is identified by a SIP URI (uniform resource identifier), which is a unique HTTP-like (hypertext transport protocol) URI of the form sip:client@domain. All user agents can REGISTER with a SIP directory server (which can be co-located with one of the SIP proxies 12) with their IP address. The mapping of a URI to the IP address of a device registered by the user is done using intermediate SIP proxies and directory servers as part of the session setup process. Details of the SIP protocol can be found in J. Rosenberg et al. SIP: Session Initiation Protocol. RFC 3261. IETF, June 2002.
SIP defines a set of control signals, such as OPTION, OK, INVITE, RINGING, ACK, BYE, etc. to set up a data session between clients. These signals are routed through SIP proxies that are deployed in the network. DNS SRV (Domain name system for services) records in the domain directory servers are used in finding the IP address of a name for a particular domain but this process many use several and often more than one SIP proxy.
All requests from an originating client such as an INVITE are routed by the proxy to an appropriate destination client based on the destination SIP URI included in the INVITE signal. Proxies may query directory servers to determine the current bindings of the SIP URI. Signals are exchanged between clients, proxies and directory servers to locate the appropriate endpoints for media exchange. For reasons of scalability, multiple proxies are used to distribute the signalling load. A normal session is setup between two clients through SIP signalling comprising of an INVITE, an OK response and an ACK to the response. The call setup is followed by media exchange using RTP (real time transport protocol). The session is torn down through an exchange of BYE and OK messages.
SIP distinguishes between the process of session establishment and the actual session. A basic tenet of SIP is the separation of signalling (control) from media (RTP stream) messages. Control signals are usually routed through the proxies while the media path is end-to-end. The signals like INVITE contain user parameters using Session Description Protocol (SDP) in the message body (Handley, M. and V. Jacobson, SDP: Session Description Protocol, RFC 2327, IETF April 1998). SDP provides information about the session such as parameters for media type, transport protocol, IP addresses and port numbers of endpoints. The IP address and port numbers exchanged through SDP is used for the actual data transmission (media path) for the session. Any of these parameters can be changed during an ongoing session through a RE-INVITE message, which is identical to the INVITE signal except that it can occur within an existing session. In addition, a client can transfer an existing session by using a REFER signal. This signal instructs the other endpoint of an existing session to initiate an INVITE/OK/ACK exchange with a third client and terminate the existing session (with the sender of the REFER signal).
By default, SIP signals are transmitted with UTF-8 plain text encoding even though they may contain confidential information. However, to maintain privacy the two IP components of a SIP call, the signals and the data stream, can be encrypted. The calling client may request encryption of the signalling with the first proxy but there is no mechanism for ensuring that subsequent SIP servers encrypt the signal. When the signalling is unencrypted, and IP router that intercepts the signalling between proxies could identify call information such as the identities and internet protocol address of both parties. The calling client would be unaware that the signals were transmitted in plain text on the network. The data stream needs only to be encrypted and decrypted at the end points of the call.
An alternative solution is to have partial encryption of the signalling where only SIP headers essential to intermediate proxies are transmitted in plain text. This is typically implemented using S/MIME (Secure Multipurpose Internet Mail Extension—a format and protocol for adding a signature and/or encryption services to internet messages). This alternative method has two drawbacks. First, since only partial encryption occurs, the level of confidentiality is lower than when using full encryption. Second, as has been noted in RFC 3261, there may be rare network intermediaries (not typical proxy servers) that rely on viewing or modifying the bodies of SIP messages (especially SDP). Use of Secure MIME may prevent these sorts of intermediaries from functioning.
Lastly it should be noted that by using a SIPS URI the user is not guaranteed end-to-end encrypted transport. The user is only guaranteed encrypted transport “from the caller to the domain of the callee” (RFC 3261 Section 4.2)
It is known for a first party to send an invitation to a second party to open a communication channel in the network. The communication channel may be secure once the protocol has been agreed but the initial invitation, which contains sensitive information such as the id of the first and second party, is not “Security mechanism agreement for SIP” is described in RFC3329. The purpose of RFC3329 is to define what encryption to use between two SIP network components i.e. a low, medium or high encrypted link between the two points. The RFC uses word token to describe the syntax of sip header fields, but does not describe creating a secure path through one or more proxies.