Services using Voice over IP (Internet Protocol) technology to transmit voice signals is becoming widespread. As shown in FIG. 1, a voice signal from an input terminal 11 is converted in a voice signal transmitting unit 12 into voice packets, transmitted over a packet communication network 13 such as an IP network to a voice signal receiving unit 14, where the voice signal is reproduced and outputted to an output terminal 15. There exists a problem that when packets are communicated in real time, packet losses can occur on the packet communication network 13 depending on the conditions of the network, causing quality degradation such as audible discontinuity in reproduced speech. This problem is pronounced especially during network congestion in the so-called best-effort communication services such as the Internet, which tolerate packet losses.
Therefore, a technique called packet-loss concealment is used when voice signals are transmitted over a packet communication network. This approach uses a technique in which if a packet is lost somewhere on the communication channel or does not arrive at a receiving end within a time limit due to a delay on the communication channel, then the voice signal in the segment corresponding to the packet that has vanished or has not arrived (hereinafter referred to as a “loss packet” or “lost packet”) is estimated and compensated at the receiving end. FIG. 2 shows an example of a typical configuration of the voice signal transmitting unit 12 shown in FIG. 1. An input voice signal is stored in an input buffer 21, then the voice signal is split into time segments, called frames, having a predetermined length to generate voice packets in a voice packet generating unit 22, and the voice packets are sent out from a packet transmitting unit 23 to a packet communication network. The time length of one frame is typically 10 to 20 milliseconds or so.
FIG. 3 shows an example of a typical configuration of the voice signal receiving unit 14 shown in FIG. 1. Voice packets received at a packet receiving unit 31 through a packet communication network are stored in a receiving buffer 32, also called a jitter absorbing buffer. For a frame whose packet has been successfully received, the voice packet is extracted from the receiving buffer and decoded into a voice signal in a voice packet decoding unit 33. For a frame whose packet is lost, packet-loss concealment processing is performed to generate a voice signal in a lost signal generating unit 34 and the generated voice signal is outputted. If pitch period (the length equivalent to the fundamental frequency of sound on the time axis) information is used for packet-loss concealment processing, the output voice signal is stored in an output voice buffer 35, then pitch analysis on the signal is performed in a pitch extracting unit 36 and the obtained value of the pitch period is provided to the lost signal generating unit 34. The signal generated in the lost signal generating unit 34 is outputted to the output terminal 15 through a selector switch 37. If there is no packet loss, the decoded signal from the voice packet decoding unit 33 is outputted to the output terminal 15 through the selector switch 37. It should be noted that communication terminals that perform bidirectional voice communication have both transmitter and receiver. A well-known typical packet-loss concealment technique is the one described in Non-patent literature 1. The technique in Non-patent literature 1 uses the pitch period of sound for packet-loss concealment.
FIG. 4 shows a typical packet-loss concealment technique which is also used in Non-patent literature 1. FIG. 4 shows processing performed when a packet corresponding to frame n, the current frame at the receiving end has been lost. It is assumed here that the previous frames (until frame n−1) of the voice signal have been properly decoded or the voice signal for a lost packet has been generated through the use of packet-loss concealment. A voice signal waveform of the segment 3A equivalent to 1-pitch period is cut out from the last sample point of the previous frame n−1 and the 1-pitch period waveform cut out is repeated to fill the segments of frame n (segments 3B-3D).
By repeating the previous 1-pitch waveform to generate the waveform of a packet-loss frame in this way, speech can be reproduced with a natural speech quality, compared with padding all sample points in the frame n with zero values without applying any processing.
When a 1-pitch waveform is simply repeated, scratchy noise can be generated at connection points because of discontinuities of the waveforms at connection points. In such discontinuities at connection points can be prevented by using a technique shown in FIG. 5. For clarity, the segments of frame n are shown as staggered tiers of cut-out waveforms in FIG. 5. First, a waveform of segment 4A which has a length slightly longer than 1 pitch period, for example 5L/4 (5×L/4), from the last sample point of frame n−1 is cut out, where L is the pitch length. The cut out waveform is placed positions 4B, 4C, and 4D shifted by one pitch length. Because the cut out wavelength is longer than one pitch length, overlapping segments 4AB, 4BC, and 4CD result. These overlapping segments are superposed by applying a triangle window function shown in FIG. 6, for example. In FIG. 6, the horizontal axis represents time, the vertical axis represents weight, t1 indicates the starting point of an overlapping segment, and t2 indicates the end point of the overlapping segment. For example, in the case of overlapping segment 4BC in FIG. 5, the cut out waveforms in segments B and C can be smoothly interconnected by multiplying the waveform of the portion of segment 4B in overlapping segment 4BC by a weighting function W1, and multiplying the waveform of the portion of segment 4C in overlapping segment 4BC by a weighting function W2, and then by adding the products together. The details of such superposition are also described in Non-patent literature 1.
It is said that the quality of sound generated by using the technique described in Non-patent literature 1 in a communication environment in which packet losses occur is generally good. However, if a packet loss occurs near the boundary between a consonant and a vowel in speech, uncomfortable noise can be generated (a first issue). Furthermore, if the packet of consecutive multiple frames are lost (referred to as a burst loss), that is, if packets of two or more consecutive frames, each having the length of 20 milliseconds, are lost, or if the packet of one frame in a voice encoding format with a long frame length is lost, that is, if the packet of a voice encoding format with a frame length of 40 or 60 milliseconds is lost, noisy buzz sound or unnatural sound is generated (a second issue).
The first issue results from creation of a waveform having the same characteristics as those of the voice waveform of the immediately preceding frame to generate sound of a loss frame in the method described in Non-patent literature 1. That is, if a frame nearer to the vowel around a boundary between a consonant and a following vowel is lost, a sound waveform having the same characteristics as the consonant is generated although the fact is that the lost frame is the period of the vowel. Similar noise can be generated at the time when sound changes from a vowel to silence or a consonant.
The second issue can arise even when a packet loss occurs in a segment that is not near the boundary between a consonant and a vowel. This is caused by the fact that the sound in the packet loss frame is reused (self-recursively) to generate a sound waveform having the same characteristics in the adjacent, subsequent frame loss segment and therefore the sound waveform with the same characteristics is reproduced consecutively over a period of time as long as 40 to 60 milliseconds or more. The pitch period and power of actual voice slightly changes and, when sound with the same characteristics is reproduced consecutively, the sound is perceived as sound differently from voice.
To solve the first and second issues, the technique described in Non-patent literature 2 has been proposed. In the technique in the literature, side information for the k-th frame is embedded in the k+1-th frame beforehand. If the k-th frame has not arrived due to a packet loss, then the side information embedded in the k+1-th frame is used to conceal the error in the k-th frame.
Non-patent literature 1: ITU-T Recommendation G.711 Appendix I, “A high quality low-complexity algorithm for packet loss concealment with G.711”, pp. 1-18, 1999.
Non-patent literature 2: Naofumi Aoki, “A packet loss concealment technique for VoIP using steganography based on pitch waveform replication”, IEICE Vol. J86-B, No. 12, pp. 2551-2560, 2003.