Voice Over Internet Protocol (VOIP) refers to a communication service, such as that used for voice, fax, SMS, and/or voice-messaging applications, that is transported via the internet rather than the public-switched telephone network. The typical steps involved in originating a VOIP communication session are signaling and media channel setup, digitization of the analog voice signal, encoding, packetization and transmission of the voice signal as internet protocol (IP) packets over a packet-switched network. On the receiving side, similar steps, typically in the reverse order, such as reception of the IP packets, decoding of the packets, digital-to-analog conversion and reproduce the original information, such as a voice stream of a user, occur.
VOIP systems also usually employ session control protocols to control the setup and tear-down of calls as well as audio codecs which encode speech and allow transmission of the same over an IP network as digital audio via an encoded audio stream. The codec used is varied between different implementations of VOIP while some implementations rely on narrow-band and compressed speech and others support high fidelity stereo codecs.
VOIP has been implemented in numerous ways using both proprietary and open protocols and standards such as H.323, IP Multimedia Subsystem (IPMS), Media Gateway Control Protocol (MGCP), Session Initiation Protocol (SIP), Real-time Transport Protocol (RTP), Session Description Protocol (SDP), the Skype protocol, and the like.
Communication over an IP network can be less reliable in contrast to the traditional circuit-switched public telephone network, as the IP network does not typically provide a network-based mechanism to ensure the data packets are not lost, or that they are delivered in a sequential order. IP networks are typically a best-effort type of network without fundamental quality of service guarantees (QoS). Therefore, VOIP implementations may face problems mitigating latency, jitter, packet loss and packet reception order. By default, IP routers handle traffic on a first-come, first-served basis, with routers on high volume traffic links introducing latency that exceeds permissible thresholds for VOIP. Fixed delays can typically not be controlled as they are caused by the physical distance the packets travel, however, latency can be minimized by marking voice packets as being delay-sensitive with known techniques. A VOIP packet usually has to wait for the current packet to finish transmission, although it is possible to preempt a less important packet in mid-transmission, although this is not commonly done, especially on high-speed links where transmission times are short, even for maximum-sized packets.
A number of protocols have been defined to support the reporting of QoS/QOE (Quality of Experience) for VOIP calls. There are also layer-two quality of service metrics that focus on quality of service issues at the data link layer and physical layer that can be used to ensure that applications such as VOIP work well even in congested network environments.
As discussed, VoIP is the descriptor for the technology used to carry digitized voice over an IP data network. VoIP typically requires two classes of protocols: a signaling protocol such as SIP, H.323 or MGCP that is used to setup, disconnect and control the calls and telephony features; and a protocol to carry speech packets. For example, the Real-Time Transport Protocol (RTP) carries speech transmission. RTP is an IETF standard introduced in 1995 when H.323 was standardized. RTP will work with any signaling protocol and is the commonly used protocol among IP PBX vendors. Most IP phones or softphones generate a voice packet every 10, 20, 30 or 40 ms, depending on the vendor's implementation. The 10 to 40 ms of digitized speech can be uncompressed, compressed and/or optionally encrypted with many packets utilized to carry one word.
The voice codecs encode the voice data in the packet structures for transmission over the data network and can compare the acoustic information (each frame of which includes spectral information such as sound or audio amplitude as a function of frequency) in temporally adjacent packet structures and assign to each packet an indicator of the difference between the acoustic information in adjacent packet structures. The voice codec typically includes, in memory, numerous voice codecs capable of different compression ratios. Some typical codecs include G.711, G.723.1, G.726, G.728, and G.729, however it is to be understood that any voice codec whether known currently or developed in the future could be in memory. Voice codecs encode and/or compress the voice data in the packet structures. For example, a compression of 8:1 is achievable with the G.729 voice codec (thus the normal 64 Kbps PCM signal is transmitted in only 8 Kbps). The encoding functions of codecs are further described in Michaelis, Speech Digitization and Compression, by Michaelis, P. R., available in the International Encyclopedia of Ergonomics and Human Factors, pp. 683-685, W. Warkowski (Ed.), London: Taylor and Francis, 2001; ITU-T Recommendation G. 729 General Aspects of Digital Transmission Systems, Coding of Speech at 8 kbit/s using Conjugate-Structure Algebraic-Code-Excited Linear-Prediction, March 1996; and Mahfuz, Packet Loss Concealment for Voice Transmission Over IP Networks, September 2001, each of which is incorporated herein by reference.
There are also many techniques available for the digitization and creation of the packets. For a general discussion of the operation of vocal tract models, see Speech Digitization and Compression. In general, these techniques use mathematical models of the human speech production mechanism. Accordingly, many of the variables in the models actually correspond to the different physical structures within the human vocal tract that vary while a person is speaking. In a typical implementation, the encoding mechanism breaks voice streams into individual short duration frames. The audio content of these frames is analyzed to extract parameters that “control” components of the vocal tract model. The individual variables that are determined by this process include the overall amplitude of the frame and its fundamental pitch. The overall amplitude and fundamental pitch are the components of the model that have the greatest influence on the tonal contours of speech, and are extracted separately from the parameters that govern the spectral filtering, which is what makes the speech understandable and the speaker identifiable. Tone contour transformation may therefore be performed by applying the appropriate delta to the original amplitude and pitch parameters detected in the speech. Because changes are made to the amplitude and pitch parameters, but not to the spectral filtering parameters, the transformed voice stream will still generally be recognizable as being the original speaker's voice. The transformed speech may then be sent to the recipient address, stored, broadcast or otherwise released to the listener. For example, where the speech is received in connection with leaving a voice mail message for the recipient, sending the transformed speech may comprise releasing the transformed speech to the recipient address.