1. Field of the Invention
The present invention generally relates to a VoIP (Voice over Internet Protocol), and more particularly, a call routing method based on MOS (Mean Opinion Score) prediction value for providing a highest voice quality under restricted environment by predicting voice quality over IP (Internet Protocol) using network parameters that affect the voice quality, and performing a call routing from a gateway group to a gateway with a maximum prediction MOS value.
2. Description of the Related Art
For many years, technicians have put a great deal of efforts to operate voice calls over public telephone network more efficiently and to improve voice quality. Thanks to them, the public telephone network at present is capable of supporting real-time voice services required qualities like low-propagation delay, and jitter. Users became used to the voice quality over the public telephone network, and they now agree that the voice quality is standardized.
On the other hand, IP network has been implemented by non-real time application, such as, primary file transmission and e-mail. Since the non-real time application requests broadband and bust traffic, occasional delay and jitter were not considered as serious problems.
If public telephone network and IP network were integrated, the IP network would have to go through conversion to a net architecture for assuring QoS (Quality of Service) for voice services. Because the Internet is based on a packet switching system, there are various kinds of parameters that affect speech quality. Also, speech quality is easily deteriorated by many factors involved in net interworking between public telephone network and Internet.
To utilize Internet bandwidth more efficiently, voice data is often compressed for transmission. Nonetheless, voice quality was deteriorated no matter what kind of transmission system was used. Especially, when the voice data was compressed more than twice in a row, the voice quality was deteriorated considerably.
Compressor/decompressor (codec) systems and digital signal processing (DSP) are commonly used in voice communications because they conserve bandwidth. But they also degrade voice fidelity. The best codecs provide the most bandwidth conservation while producing the least degradation of the signal. Bandwidth can be measured using laboratory instruments, but voice quality requires human interpretation.
Also, voice data is very sensitive to delay. In general, most users start complaining about voice quality when network (transfer) delay totals 150-200 ms.
Network delays include propagation delay and handling delay. The propagation delay usually occurs in the network using optical fiber or copper as a medium. The handling delay occurs primarily because communication equipment that processes information takes more than necessary time to input and output information, and the handling delay includes codec delay and queuing delay.
Besides the propagation delay and the handling delay, jitter is another factor that has a great impact on the voice quality. Jitter is a deviation between predicted voice data packet to be received and actually received packet. For instance, suppose that a transmitter transmitted packets A, B, and C at a regular time interval (i.e. D1). However, a receiver may not receive the packets A, B and C at regular time intervals because the propagation delay and the handling delay change depending on the traffic state of network.
Jitter deteriorates the quality of voice data substantially when the interval between data is not uniform. As an attempt to solve the jitter, communication equipment establishes a jitter buffer to uniform the packet interval. The jitter is found more frequently in data communication equipment, such as, a router or a frame relay switch, which disregards jitter when managing traffic and handles voice data using the same method with that of general data.
Above all, echo is probably one of the factors that have the greatest influence on voice quality. Echoes occur because of the impedance difference between a 4-wire switch to a 2-wire switch on the conventional long-distance telephone network.
Most people feel secure when they hear their own voices through a handset while talking on the phone. However, if they hear their voices echo after 25 ms, they would not feel the same way. Rather, the echoes interfere their conversation. Introduced to get rid of this echoing problem is an echo canceller, which stores negative inverse image for a certain period of time and deletes data being echoed among the received data.
Echo canceller is limited by a total waiting time for an echoed voice. This time is called echo trail. In general, about 32 ms is long enough for the waiting time.
Meanwhile, the most important thing to be considered for designing the VoIP network is limiting its bandwidth. Bandwidth varies depending on the kind of codec and the number of frames per packet. For example, suppose that two 10 ms G.729 voice frames are loaded into a single packet using 8 kbps G.729 codec. However, 24 kbps of bandwidth is actually needed in this case. Although each frame would require 20 bytes because each G.729 voice packet is in 10 bytes, IP (Internet Protocol), RTP (Real-Time Transport Protocol) and UDP (User Datagram Protocol) headers in 40 bytes are required for each packet. In this case, even though header overhead of a data link layer (e.g. PPP (Point-to-Point Protocol), Frame Relay, Ethernet etc.) is not included, the header overhead is twice of the voice payload. Therefore, 24 kbps of bandwidth is appropriate only for a high-speed transmission circuit like T1 (1.544 Mbps) or E1 (2.048 Mbps), but it is rather a burden on a low-speed transmission circuit like 56 kpbs.
RTP is an Internet standard protocol for transmitting real-time data including voice and video over the IP network. RTP is composed of a data part and a control part called RTCP. RTP supports real-time application programs like audio or video programs, and has several functions, e.g. timing reset, loss detection, contents identification and so forth. RTP supports QoS as well as synchronization of diverse media streams at a destination.
In addition, RTP generates a 20-byte payload under a packet voice environment (G.729) where a voice frame is sampled every 20 ms. The voice packet at this time is composed of IP header (20 bytes), UDP header (8 bytes), RTP header (12 bytes), and payload (20 bytes). Because 40-byte header is twice bigger than the payload, the header uses up a majority part of the bandwidth while generating a packet every 20 ms. To eliminate waste of header, there is a method called CRTP (Compressed RTP). As the name implies, CRTP involves compressing the header.
Particularly, the CRTP method is good for the low-speed transmission circuit since it reduces available bandwidth, such as, from 24 kbps to 11.2 kbps. For instance, if CRTP is used in 56 kbps transmission circuit, only four G.729 VoIP calls can be dealt with. According to the CRTP method, if UDP checksum is not sent to IP/UDP/RTP header, the header is reduced to 2 bytes, and if the UDP checksum is used, the header is reduced to 4 bytes.
The CRTP method is pretty much similar to the TCP (Transmission Control Protocol) header compression method. Both are based on the idea that the content difference between two packets is same although header of every packet varies in diverse fields. A primary difference between an uncompressed header and a session is shared between a compressor and a decompressor. In case a secondary difference of every thing to be transmitted is 0 (zero), the decompressor, upon receiving each compressed packet, adds the primary difference to the uncompressed header being stored. As a result, the original header can be reestablished without losing any information.
Similar to TCP/IP header compression where a plurality of TCP connections is shared at the same time, it is important to maintain a plurality of session environments in IP/UDP/RTP. Generally, session environment is defined as a combination of IP source and destination addresses, UDP source and destination ports and RTP SSRC (Synchronization Source) field, etc. A compression system applies a hashing function to this type of field and indexes a pre-stored session environment table.
A compressed packet has a small integral called SCID (Session Context Identifier), and with the help of this, one can find out a session environment whose packet needs to be interpreted. The decompressor indexes the pre-stored session environment table using this SCID.
In most cases, CRTP can compress a 40-byte header to a 2-4 byte header. If a particular field is changed in the IP/UDP/RTP header, the header cannot be compressed because the content of the header is not the same. In other words, if a change is made in a filed like the payload, the original uncompressed header should be transmitted. To the CRTP, bandwidth is the main issue, so it is highly recommended to a WAN interface with many RTP traffics.
In a high-speed backbone network, however, transmission quantity and transmission speed are very high. Hence, the CRTP is not appropriate in this case because of the compression/decompression process overhead.
Meanwhile, there is a multiframe transmission method as an example of bandwidth managing methods. As the name implies, plural frames are transmitted together in order to reduce overhead of each kind of header. If the RTP payload can be transmitted using this method, it is possible to reduce the bandwidth that is actually used by the RTP payload.
When constructing a frame, one cannot simply ignore the bit occupied by each header. Considering cell transmission efficiency, it is better to transmit a plurality of data cells under one header than to transmit each data cell under one header.
In case of G.723.1, up to three (90 ms) can be transmitted as a multiframe. Meanwhile, in case of G.729A, up to 9 (90 ms) can be transmitted as a multiframe. For jitter buffering of a H.323 endpoint terminal receiving a voice packet, at least two (180 ms) (e.g. multiframe) frames need to be buffered. This is because H.323 Spec limits the terminal delay to no more than 180 m.
As such, 3 or 9 frames are packed together for transmission. Although it is possible to pack more frames as needed, in such case, transmission delay occurs as aforementioned. Hence, it is recommended to pack the appropriate number of frames. For instance, in case of G.723.1, 2 or 3 frames are packed together and transmitted in a multiframe structure. Also, in case of G.729A, a maximum of 9 frames are packed and can be transmitted in a multiframe structure.
According to the dynamic jitter buffering method, CRTP, and multiframe method of the related art, however, each processor is operated individually following a predetermined method, provided a call route is established. Therefore, the methods of the related art are not able to respond to organically changing network environment more actively, and as a result, voice quality based on those methods is not good.
Moreover, the above methods of the related art provide only organically treated voice quality by the RTP packet under the influence of network after the call setup is completed, depending on a predetermined call routing path. Hence, voice quality based on those methods is very limited.
In recent voice communications, particularly Internet telephony, a mean opinion score (MOS) provides a numerical measure of the quality of human speech at the destination end of the circuit. The scheme uses subjective tests (opinionated scores) that are mathematically averaged to obtain a quantitative indicator of the system performance. The following U.S. patents utilize mean opinion score (MOS): U.S. Pat. No. 6,490,552 to K. Y. Martin Lee et al. entitled “Methods and Apparatus for Silence Quality Measurement” and U.S. Pat. No. 6,609,092 to Oded Ghitza et al. entitled “Method and Apparatus for Estimating Subjective Audio Signal Quality from Objective Distortion Measures,” and are incorporated by reference herein.