For many years voice telephone service was implemented over a circuit switched network commonly known as the public switched telephone network (PSTN) and controlled by a local telephone service provider. In such systems, the analog electrical signals representing the conversation are transmitted between the two telephone handsets on a dedicated twisted-pair-copper-wire circuit. More specifically, each telephone handset is coupled to a local switching station on a dedicated pair of copper wires known as a subscriber loop. When a telephone call is placed, the circuit is completed by dynamically coupling each subscriber loop to a dedicated pair of copper wires between the two switching stations.
A circuit switched system inherently has a level of security adequate for the day to day telephone communication needs of the average person—even when using DTMF driven menus for entering account numbers and passwords for accessing and/or performing financial transactions.
First, the circuit switched systems are relatively secure and reliably route a telephone call to the destination bound to the telephone number dialed. While possible to route a call (or many calls) to an “imposter” destination for purposes of using call content (such as DTMF tones representing account numbers and passwords) for criminal activity, the expense and complexity required to do so makes it an impractical means for average criminals.
Secondly, eves-dropping or wire-tapping requires coupling a listening device directly to the circuit—which is cumbersome. Wiretapping multiple lines anywhere but at a switching station requires coupling to each circuit. While it is theoretically possible for one with criminal intent to wire tap many lines, again, the expense and complexity required to do so makes it an impractical means for average criminals.
However, recently telephone service has been implemented over the Internet. Advances in the speed of Internet data transmissions and Internet bandwidth have made it possible for telephone conversations to be communicated using the Internet's packet switched architecture and the TCP/IP and UDP/IP protocols.
To promote the wide spread use of Internet telephony, the International Telecommunication Union (ITU) has developed the H.323 set of standards and the Internet Engineering Task Force (IETF) has developed the Session Initiation Protocol (SIP) and the Multi-Media Gateway Control Protocol (MGCP) for signaling and establishing peer-to-peer Voice-over-Internet Protocol (VoIP) media session.
In an example of using an MGCP system, an MGCP gateway, commonly called a multi-media terminal adapter (MTA), emulates a PSTN central office switch for supporting operation of one or more PSTN telephony devices. The MTA detects such events as on hook, off hook, and DTMF signaling and generates applicable notify (NTFY) messages to inform a remote MGCP call agent of each event. The MTA also receives various messages from the MGCP call agent and, in response, generates applicable in-band signals (such as ring, caller ID, and call waiting) on the PSTN link to the PSTN telephony device.
To establish a peer-to-peer media session between two MTAs, the calling MTA initiates the session by sending applicable notify (NTFY) messages to an MGCP call agent. The MGCP call agent sends a sequence of create connection (CRCX) messages and modify connection (MDCX) messages to each of the calling MTA and the callee MTA such that the two can establish a real time protocol (RTP) media session there between using UDP/IP channels.
A problem associated with such Internet telephony systems is that network architecture typically includes an architecture with “multi-drop” subnets wherein the frames representing an RTP media session are available to any other device coupled to the subnet. This architecture enables an individual to easily and inexpensively eves-drop on all of the RTP media sessions transmitted on the subnet. More specifically, applicable network systems and software which can be run on a personal computer (PC) coupled to the subnet could simultaneously detect, sequence, and record all RTP media session transmitted on the subnet. Further, if there is a desire to perpetuate financial fraud, the same PC would be capable of running software to detect DTMF tones representing account numbers and passwords within the various RTP media sessions.
It is certainly possible to encrypt the RTP media session to avoid eves-dropping. However known encryption systems and key management systems are ineffective, cumbersome and/or expensive when applied to a system that could include thousands of RTP endpoints establishing peer to peer media sessions for the exchange of real time media.
For example, an asymmetric encryption algorithm and digital certificates could be used for mutual authentication of the two RTP endpoints and to secure the RTP media session there-between. However, digital certificate distribution is cumbersome and costly. Further, asymmetric encryption systems require significant processing power. In an environment wherein the RTP media stream must be encrypted and deciphered within a limited period of time to avoid noticeable communication delays, the circuits required for implementing an asymmetric encryption algorithm would be extremely costly.
As another example, an asymmetric encryption algorithm and digital certificates could be used for mutual authentication of the two RTP media session endpoints, but a symmetric encryption algorithm and an agreed key could be used for securing the RTP media session. Such a system would have the benefit that the circuitry required for performing symmetric encryption and deciphering within the time frames required to avoid noticeable delay in an RTP media session is inexpensive and readily available. However, each RTP media session endpoint would still be required to perform asymmetric encryption algorithms and have expensive digital certificate technology for mutual authentication and for the exchange of messages needed for mutual ascent to the symmetric encryption key.
As yet another example, a symmetric encryption algorithm using Diffie-Hellman key agreement could be used for mutual ascent to the symmetric encryption key for securing the media session. Because a symmetric key calculated by each MTA using Diffie-Hellman can not be derived from the Diffie-Hellman public values exchanged over the network, eves-dropping on the media session by a third party is computationally infeasible. However, if the exchange of Diffie-Hellman public values occurs using plain text, there is no mutual authentication. An imposter on the subnet could place itself between the two legitimate endpoints and substitute its own Diffie-Hellman public values in message key agreement exchanges with each endpoint—thereby becoming a “middle-man” through which the RTP media session is translated. The middle-man would then have access to the unencrypted RTP media session.
Of course, an asymmetric encryption algorithm could be used for mutual authentication of the two RTP media session endpoints and to secure the exchange of Diffie-Hellmen key agreement messages. However, in which case: i) Diffie-Hellman adds no value because the key exchange channel is secured using the asymmetric encryption algorithms—less complex key agreement schemes could be used. Further, each RTP media session endpoint would still be required to perform asymmetric encryption algorithms and have expensive digital certificate technology for mutual authentication and for the exchange of messages needed for mutual ascent to the symmetric encryption key.
What is needed is a system and method for securing an RTP media session that does not suffer the disadvantages of known systems. What is needed is a system and method for securing an RTP media session that does not require digital certificate distribution (or distribution of other mutual authentication systems) to each of multiple RTP media session endpoints and/or the performance of asymmetric encryption algorithms by each of multiple RTP media session endpoints.