The present invention relates generally to the field of internet protocol (IP) communications, and specifically to a system for securely delivering voice packets over an IP network.
Conventional stream ciphers for securely encrypting information in communication networks are well known. Stream ciphers, a class of encryption algorithms, may be employed to encrypt data. Encryption converts data into an unintelligible form, e.g., ciphertext, that cannot be easily understood by unauthorized users. The reverse process, known as decryption, converts encrypted content to its original form such that it becomes intelligible. Simple ciphers include a rotational shift of letters in the alphabet, the substitution of letters for numbers, and the “scrambling” of voice signals by inverting the sideband frequencies.
More complex ciphers work according to sophisticated computer algorithms that rearrange the data bits in digital information. In order to easily recover the encrypted information, the correct decryption key is required. The key is an algorithm that decodes the work of the encryption algorithm. The more complex the encryption algorithm, the more difficult it becomes to decode the communications without access to the key. Encryption algorithms are well known to those of ordinary skill in the art and need not be discussed in detail.
In internet protocol (IP) networks, there are various instances in which encryption may be employed. A user may wish to communicate voice packets over the Internet (VoIP) via a personal computer to a remote end user's personal computer, for example. Similarly, a head end (cable central office) may wish to transmit multimedia information to its consumers using RTP (Real Time Protocol). Advances in compression algorithms and computer processing power make it possible to support real time communication over packet networks. Protocols like RTP now provide end-to-end transport functions for multimedia transmissions. Typically, a user is coupled to an IP network via a telephony adapter (TA). In packet cable networks, a cable telephony adapter (CTA) or multimedia terminal adapter (MTA) may be employed. The MTA converts content such as voice or data into packets for transmission on the network, and converts received packets into digital or analog signals for use by the user. To implement a secure channel between two users in the IP network, the associated MTAs use the same keys and encryption ciphers.
One such stream cipher is RC4, which involves continuously generating a random key stream (of bytes) which is combined with original clear text data using an exclusive or (XOR) logic. Like various stream ciphers, however, RC4 requires that the same portion of a key stream must not be reused to encrypt multiple messages. Failure to meet this constraint will result in the encryption being more susceptible to unauthorized decryption. Furthermore, many stream ciphers require an external synchronization source which enables the sender and receiver key streams to be synchronized. In this manner, the cipher text can be decrypted at the remote location.
Within PacketCable, for example, time stamps (RTP) are used as a pointer (synchronization source) to the RC4 random stream of bytes. The RTP time stamp is a number (32 bit) contained within an RTP packet header which specifies the sampling instant of the first byte in the RTP packet. The sampling instant is derived from a clock which increases linearly in time, so the time stamp can be used for synchronization. Specifically for audio streams,RC4 Key Stream Offset=Frame Number*Frame size
The frame number is the number of audio frames generated since the start of the stream and can be derived directly from the RTP time stamp. The Frame size is given in bytes.Frame Number=(RTP Time stamp−RTP Initial Time stamp)/Nu
where Nu is the number of audio samples in an uncompressed frame of audio.
However, only some of the audio CODECs are frame-based; for example, the G.711 CODEC is sample-based, where an RTP packet can contain any number of samples. In the case of a sample-based CODEC, a virtual frame size can be assumed, where all RTP packets would contain a multiple of that frame size (even though the CODEC itself is not frame-based). For example, if RTP packets with the G.711 CODEC always contain 3, 6 or 9 samples, the virtual frame size could be assumed to be 3 samples (corresponding to 1-frame, 2-frame and 3-frame packets).
Equivalently, for both frame-based and sample-based audio CODECs the RC4 Key Stream Offset calculation can be based directly on the number of samples (instead of frames). In the formula below, sample size is specified in bytes:RC4 Key Stream Offset=Sample Number*Sample sizeSample Number=(RTP Time stamp−RTP Initial Time stamp
One limitation of this invention is that it applies to fixed-rate audio CODECs only—if it is a variable-rate codec with a variable sample size, this encryption method does not apply.
Typically, CODECs (COder/DECoder) are employed for coding and decoding information into and from frames having information samples. Due to the variety of CODECs available in the industry, CODECs may implement different frame sizes. As noted, RTP time stamps are used as a synchronization source for the RC4 random stream of bytes The time stamp provides an indication of the number of audio frames processed and is typically a multiple of the frame size (plus a random initial value). However, during a communication session, if a CODEC change occurs, the frame size (as well as sample size) will also change so that the above formula can no longer be used to determine the RC4 key stream offset. Furthermore, the RTP timestamp is no longer a multiple of a new frame size. The net result of a CODEC change is that information cannot be decrypted at the receiving end.
Conventional techniques have been specified so that the time stamp continues to be a multiple of the new frame size after an audio change. One such technique is providing a formula for adjusting the timestamp, wherein an adjustment value is added to the time stamp in order to adjust the RC4 key stream. However, the adjustment value added to the time stamp depends on exactly which audio frame is being processed when the CODEC change is discovered. With the MGCP-based call signaling, each endpoint is controlled by a Call Agent (to which we also refer to as a Gateway Controller) and there is no guarantee that the two communicating endpoints will be notified (by their Call Agent) of the CODEC change at exactly the same time. Thus, a high probability exists that after the CODEC change the two MTAs would loose synchronization on their RC4 key streams and all RTP packets would not be decrypted.
A further problem relates to the receipt of identical RTP session synchronization source (SSRC) identifiers by a gateway terminating several voice connections, that is, in the event that two different sessions are assigned identical session identifiers. The RTP standard requires that each endpoint generating RTP session identifiers (SSRC) allow for the contingency that two identical SSRCs collide at a mixer or a bridge. If such a collision occurs, an RTP BYE message is employed to hang up one of the RTP sessions and a new one is restarted with a new SSRC value. Herein lies a problem similar to the above CODEC change problem. The sequence numbers and the timestamp sequence are both re-initialized which causes the re-use of portions of the previously used key stream and re-start with the same initial timestamp value.
Therefore, there is a need to resolve the aforementioned problem relating to the conventional approach for securely delivering voice packets over an IP network.