Services that use Voice over IP technology (hereinafter simply referred to as audio packet communication) to transmit or receive audio signals are becoming widespread. FIG. 1 shows an outline of such a service. An input audio signal is converted into audio packets in an audio signal transmitting device 5 and sent to a packet communication network 6. An audio signal receiving device 7 identifies and receives audio packets destined to it, and decodes them to output speech.
FIG. 2 shows the relationship between an audio data stream to be sent and audio packets. FIG. 2A shows the audio data stream to be transmitted. The audio data stream to be sent typically consists of a PCM digital sample string. The digital audio data stream is divided into equal time units (typically 10 to 20 milliseconds or so) called frames, which are then encoded into audio codes. Information such as a timestamp indicating the time when the audio code is sent out is added to the audio code, which is then sent as an audio packet. Audio packets are an intermittent signal compressed along the time axis as shown in FIG. 2B and the gaps in the intermittent signal are used for other packet transmissions. The interval between transmission timings for sending out packets from the audio signal transmitting device 5 is equivalent to the frame length of the audio data stream. The audio packets are sent to the packet communication network 6 at time intervals of one frame length.
The audio signal receiving device 7 receives audio packets which arrive at time intervals of one frame length and decodes the audio packets. One audio packet is decoded into one frame length of audio data stream as shown in FIG. 2C. Thus, the audio signal receiving device 7 can reproduce continuous sound by receiving audio packets at time intervals of one frame length.
There is a problem that substantial variations in packet arrival time arise depending on the conditions of the communication network and, as a result, packets may not arrive within a time limit (time equivalent to one frame length) and discontinuities may occur in reproduced sound. One known method for solving the problem is to provide a receiving buffer, also known as a jitter absorption buffer, to constantly store a predetermined number of packets. A problem is that if the number of packets to be stored in the receiving buffer is chosen to be a large value, large packet arrival jitter can be absorbed but a large amount of delay between reception of a packet and reproduction of sound, namely communication delay, occurs, which may make the quality of two-way voice communications awkward. On the other hand, if the number of packets to be stored in the receiving buffer is chosen to be a small value, delay in voice communication will be small but audible discontinuities will be more likely to occur when packet arrival jitter occurs. That is, there is a trade-off between communication delay and the likelihood of audible discontinuities.
One known method for solving this problem is to dynamically control the number of packets to be stored in the receiving buffer. In this method, at the beginning of communication, the number of packets to be stored in the receiving buffer is set to a small value to reduce communication delay, and when the packets stored in the buffer run out during the communication, the reproduction of sound is temporarily stopped to increase the number of packets stored in the receiving buffer by a given number to reduce the likelihood of audible discontinuities in the subsequent voice communication.
It is said that several tens of percent of the time of normal utterance are non-voice segments (background noise and silence segments) when human utterance is divided into time units of 10 to 20 milliseconds. Therefore, jitter can be addressed as follows. When the number of packets in the receiving buffer exceeds a first threshold, a non-voice segment in the decoded audio is removed to shorten the frame length, thereby quicken access to the next packet in the receiving buffer to use for sound reproduction. When the number of packets in the receiving buffer becomes smaller than a second threshold smaller than the first threshold, then a non-voice segment in the decoded audio signal is expanded to delay access to the next packet in the receiving buffer to use for sound reproduction. However, this method cannot provide control using the receiving buffer if the frequency of non-voice segment occurrences is significantly low or a non-voice segment does not occur over a long period of time.
Non-patent literature 1 describes that the time length can be increased or decreased without significant degradation of perceived audio quality by inserting or removing pitch waveforms as a unit in voice segments (a voiced sound segment and an unvoiced sound segment). Patent literature 1 describes that interpolated pitch-period audio waveforms are added in a voice segment when the number of packets stored in a receiving buffer becomes lower than a lower limit and some of the pitch-period audio waveforms in a voice segment are removed when the number of packets exceeds an upper limit in order to solve the problem with the method that the receiving buffer cannot adequately be controlled by solely using non-voice segments. Although degradation of audio quality can be reduced by inserting or removing pitch waveforms, the sound quality of reproduced sound can be degraded to an undesirable extent because the insertion and removal of pitch-period waveforms are performed on a series of frames until the number of packets stored in the buffer reaches a value between the upper and lower thresholds. Moreover, because the upper and lower thresholds are fixed, sudden changes in jitter cannot be managed and consequently packet loss may occur.    Patent literature 1: Japanese Patent Application Laid-Open No. 2003-050598    Non-patent literature: Morita and Itakura, “Time-Scale Modification Algorithm for Speech by Use of Pointer Interval Control OverLap and Add (PICOLA) and Its Evaluation”, Discourse Collected Papers of Acoustical Society of Japan, 1-4-14, Oct., 1986