In recent years, products using a VoIP (Speech over IP) technique that transfers speech data using an IP (Internet Protocol) network and enables communication, are put in practical use. In the VoIP technique, a transmission side communication terminal apparatus A/D (analog/digital) converts input speech to generate digital data, and packetizes the generated digital data per predetermined amount of data, and transmits the result to a network such as the Internet. In the VoIP technique, the transmission side communication terminal apparatus may compress and encode digital data before packetizing. Incidentally, the individual digital data divided for packetizing is referred to as a frame, and the frame to which header information indicating the type, destination, and the like of the data is attached is referred to as a packet. On the other hand, in the VoIP technique, the receiving side communication terminal apparatus cannot predict a packet arrival timing and the order, and therefore, in order to absorb the fluctuation in the timing and the changes of order, by temporarily storing the received packet in a buffer, extracting the packet from this buffer in a predetermined cycle, and carrying out D/A (digital/analog) conversion or the like, speech is played back.
Moreover, in the VoIP technique, RTP (Real-time Transport Protocol) may be adopted. The packet based on RTP has a time stamp field in the header. The receiving side communication terminal apparatus knows the order and timing for playing back the frame using the time stamp in the received packet.
Here, in the IP network, a single clock which all nodes use as a reference does not exist, each node results in operating using each internal clock as a reference. There are originally individual differences in the internal clock generating apparatuses, and the operation of the apparatuses is influenced by a change in the operating environment such as temperature, and therefore it is rare that the internal clocks are completely in synchronization between different nodes. Therefore, unless the internal clock of the transmission side communication terminal apparatus is synchronized with the internal clock of the receiving side communication terminal apparatus, instantaneous deterioration of speech quality such as a break or a sound skipping during a call, is most likely to occur at the receiving side communication terminal apparatus in accordance with an increase in the duration of a call. For example, when a frequency of the internal clock of the transmission side communication terminal apparatus is lower than a frequency of the internal clock of the receiving side communication terminal apparatus, if it is assumed that, at the receiving side communication terminal apparatus, a fixed amount of received packets is periodically extracted from a buffer, the stored amount of packets in the buffer decreases gradually. Accordingly, the buffer will be empty sooner or later and the speech playback will be interrupted. On the other hand, when the frequency of the internal clock of the transmission side communication terminal apparatus is higher than the frequency of the internal clock of the receiving side communication terminal apparatus, if it is assumed that, at the receiving side communication terminal apparatus, a fixed amount of received packets is periodically extracted from the buffer, the stored amount of packets in the buffer increases gradually. Accordingly, the packet will overflow from the buffer sooner or later and be discarded, and therefore an instantaneous sound skipping will occur sporadically. Here, if it is assumed that the buffer capacity is unlimited, the sound skipping will not occur, but a delay in the played back speech increases gradually in accordance with the elapse of the duration of a call, so that the call lacks real time characteristics.
Then, conventionally, techniques have been devised that prevent a break or a sound skipping in the speech playback from occurring in the receiving side communication terminal apparatus which receives stream data via a network. For example, a technique is listed that dynamically adjusts the frequency of the internal clock of the receiving side communication terminal apparatus using the internal clock of the transmission side communication terminal apparatus as a reference, or a technique that detects a difference from the internal clock of the transmission side communication terminal apparatus at the receiving side communication terminal apparatus, and interpolates data corresponding to the detected difference to the received packet or decimates data corresponding to the detected difference from the received packet (for example, see Patent Document 1). In addition, when these two types of techniques are compared, the technique according to Patent Document 1 of interpolating or decimating data can suppress the circuit scale, and therefore is advantageous in terms of the portability of the receiving side communication terminal apparatus, the manufacturing cost and the like.
FIG. 1 is a block diagram showing a configuration of a communication system described in Patent Document 1. This communication system has transmission apparatus 10, input section 11, transmission side amplification section 12, receiving apparatus 20, receiving side amplifier 31, output section 32 and network 50. Transmission apparatus 10 has A/D converter 13, input buffer 14, encoding section 15, transmission buffer 16 and transmission section 17. Input section 11 converts inputted speech into an analog speech signal, and inputs the converted analog speech signal to amplifier 12. Amplifier 12 amplifies the analog speech signal inputted from input section 11, and inputs the amplified analog speech signal to A/D converter 13 in transmission apparatus 10. A/D converter 13 converts the analog speech signal inputted from amplifier 12 into digital speech data, and inputs the digital speech data after conversion to input buffer 14. Input buffer 14 stores the digital speech data inputted from A/D converter 13 and periodically inputs the stored digital speech data into encoding section 15. Encoding section 15 converts the digital speech data inputted from input buffer 14, into compressed speech encoding information, and inputs the compressed speech encoding information after conversion to transmission buffer 16. Transmission buffer 16 stores the compressed speech encoding information inputted from encoding section 15, and periodically inputs the stored compressed speech encoding information into transmission section 17. Transmission section 17 packetizes the compressed speech encoding information inputted from transmission buffer 16, and sequentially sends out this packet onto network 50.
Receiving apparatus 20 has receiving section 21, receiving buffer 22, decoding section 23, playback speed judgment section 24, speed buffer 25, playback speed control section 26, output buffer 27 and D/A converter 28. Receiving apparatus 21 receives the compressed speech encoding information sent out from transmission apparatus 10 via network 50, and sequentially inputs the received compressed speech encoding information to receiving buffer 22. Receiving buffer 22 stores the compressed speech encoding information inputted from receiving section 21, and periodically inputs the stored compressed speech encoding information to decoding section 23 using the internal clock in receiving apparatus as a reference. Decoding section 23 decompresses the compressed speech encoding information periodically inputted from receiving buffer 22 using the internal clock as a reference to digital speech data, and inputs this digital speech data to output buffer 27. Playback speed judgment section 24 monitors the stored amount of the compressed speech encoding information in receiving buffer 22, and determines the playback speed of speech according to the changes in the stored amount, and reports the determined playback speed to speed buffer 25. Speed buffer 25 stores the playback speed reported from playback speed judgment section 24 in time series, and sequentially reports the stored playback speed to playback speed control section 26. Playback speed control section 26 controls output buffer 27 so that a data amount of the digital speech data inputted to D/A converter 28 per unit of time—the playback speed of the speech in output section 32—may be equal to the playback speed reported from speed buffer 25. Output buffer 27 stores the digital speech data inputted from decoding section 23, and, after interpolating or decimating speech sample data with respect to the stored digital speech data under the control of playback speed control section 26, output buffer 27 inputs this digital speech data to D/A converter 28. In addition, output buffer 27 interpolates or decimates the speech sample data with respect to the stored digital speech data, and thereby the playback speed of speech in output section 32 is adjusted. D/A converter 28 converts the digital speech data inputted from output buffer 27 into an analog speech signal, and inputs the analog speech signal after conversion to amplifier 31. Amplifier 31 amplifies the analog speech signal inputted from D/A converter 28 and inputs the analog speech signal after amplification to output section 32. Output section 32 outputs the analog speech signal inputted from amplifier 31 as speech.
Moreover, receiving apparatus 20 detects a difference in the sound volume component values between given digital speech data and the immediately preceding digital speech data, and, only when the difference is small, the speech playback speed is adjusted according to the increase and decrease in the stored amount of digital speech data in receiving buffer 22. That is, receiving apparatus 20 does not adjust the speech playback speed when the sound volume of speech to be played back is large, but adjusts the speech playback speed to suppress deterioration of speech quality only when the sound volume of speech to be played back is small.    Patent Document 1: Japanese Patent Application Laid-Open No. 2002-330180