1. Field of the Invention
The present invention generally relates to a voice encoding method for voice transmission through an IP (Internet protocol) network, and particularly relates to the voice encoding method that alleviates deterioration in voice quality at a receiving end when a packet is lost in the transmission.
2. Description of the Related Art
VOIP (Voice Over IP) has been known as a technology to transmit voice over an IP network. FIG. 1 shows a basic structure of a VOIP transmission system. The VOIP transmission system is principally comprised of such user terminals as telephone sets 101 and 107, access/conventional networks 102 and 106, VOPI gateways (VOIPGW) 103 and 105 and the Internet 104. VOIPGW 103 and 105 are located in between the access/conventional networks 102 and 106 and the Internet 104, respectively. FIG. 2 shows a basic structure of a voice processing unit of the VOIPGW. The VOIPGW voice processing unit is principally comprised of an access/conventional network interface 201, a voice encoding unit 202, a packet assembling unit 203, a voice decoding unit 204 and a packet disassembling unit 205. In VOIP, a voice signal that is input to the VOIPGW 103 and 105 from the access/conventional networks 102 and 106, respectively, is transmitted after encoding by the voice encoding unit 202 at a low bit rate. The encoded voice signal is multiplexed with data packets, thereby economizing the cost of voice communication.
However, the basic structure as shown in FIG. 1 suffers problems as follows. One of the problems is that a delay time becomes lengthy as packets are transmitted via a plurality of routers in the IP network. The second problem is that there is a fluctuation (i.e., jitter) in the time of packet arrivals as the packets are transmitted via various buffers. The third problem is that a packet may be lost due to data overflow at these buffers or due to errors occurring during data transmission, which deteriorates quality of voice reproduced at a receiving end.
Conventional techniques for compensating for lost packets on the transmitting side are as follows, for example. The first technique is to return information about the packet loss from the receiving end to the transmitting side so that a frame corresponding to the lost packet is retransmitted. The second technique employs an interleave process, which alleviates an effect of packet loss by randomizing errors. The third technique employs an FEC (Forward Error Correction) encoding.
Examples of conventional techniques that can be employed on the receiving side are as follows. The first is a method of inserting a waveform with respect to a lost frame. The second method interpolates a waveform from waveforms of the frames preceding and following the lost frame, or interpolates a waveform from a waveform of the preceding frame. The third method is to interpolate voice codec parameters from those of preceding and following frames so as to reproduce voice from the interpolated parameters. These techniques are described in “A Survey of Packet Loss Recovery Techniques for Streaming Audio,” IEEE Network Magazine, the September/October issue, pp.40-48, 1998, and “Internet Telephony: Services Technical Challenges, and Products,” IEEE Communication Magazine, the April issue, pp 96-103, 2000.
The first and the second techniques employed on the transmitting side are principally used in delivery services where time delays are permissible. FIG. 3 shows an example of a media specific interpolation process that corresponds to the third technique employed on the transmission side described above.
In FIG. 3, frames of an original voice stream are referred to by reference numerals 301 through 304. In this example, four frames are shown. Here, the frame 303 is coded into an coded parameter 313-3 that is ordinarily used, and is also encoded into another coded parameter 314-3 corresponding to a voice encoder having a bit rate lower than the ordinarily used bit rate. The coded parameter 313-3 that is ordinarily used and the coded parameter 314-3 corresponding to the lower bit rate voice encoder are inserted into a frame 313 and a frame 314, respectively, which have respective FEC codes added thereto, and are then transmitted as packets. If the packet 313 is lost during the transmission, the encoded parameter 314-3 of the lower bit rate voice encoder is used in place of the ordinarily used encoded parameter 313-3, thereby reproducing a waveform corresponding to the voice frame 303 that should have been transmitted by the packet 313. The processing delay in this method is one frame interval. In order to obtain voice quality of a desired level, the lower bit rate encoder is required to be capable of encoding at about 2 to 4 kbps. Accordingly, redundant data (i.e., overhead) of about 40 to 80 bits is necessary to add the encoded parameter 314-3 of the lower bit rate voice encoder in the case of a frame length of 20 msec.
Conversely, in the conventional techniques where the lost packet is interpolated on the receiving end, the interpolation process can be performed without the overhead. FIG. 4 shows a basic structure for performing a conventional interpolation method on the receiving end. FIG. 4 shows the voice decoding unit 204 that principally includes a packet disassembling unit 401, a voice decoding unit 402, and an interpolation process unit 403. An encoded parameter output from the packet disassembling unit 401 is provided to the voice decoding unit 402, which reproduces and outputs a voice waveform. If there is a packet loss in the received packets, a packet loss index indicative of the lost packet is supplied to the interpolation process unit 403. The interpolation process unit 403 performs an interpolation process, an example of which will be described in the following.
A first example is to multiply a reproduced waveform by a window function where the reproduced waveform is that of a frame preceding the lost packet, and uses the obtained waveform as the waveform of the frame that has suffered the packet loss. Alternatively, a second example is to interpolate coded parameters from frames preceding and following the frame that has suffered packet loss, thereby reproducing the voice of the frame of packet loss based on the interpolated parameters. In this case, LPC (Linear Prediction Coding) parameters, for example, are obtained by linear interpolation from parameters obtained from the frames preceding and following the frame of packet loss. As for other parameters, the same parameter values as those of the preceding frame are used.
It has been known that the method based on parameter interpolation has an advantage of better reproduction quality over other techniques employed on the receiver end for interpolating and recovering the lost packet. However, this method has following problems.
A first problem is that, despite presence of a plurality of available interpolation and recovery processes, the conventional method is configured to use only one of such processes. Accordingly, the process employed for interpolation and recovery of a lost packet may not be the best method from the viewpoint of an S/N (signal to noise) ratio or the viewpoint of subjective quality.
A second problem is that if the lost packet contains a consonant section, the interpolation recovery process may still loose clarity of voice.
HoHooHo