Conventionally, in a case of receiving and decoding content from a server of an encoder side, a content receiving apparatus separates and decodes video packets and audio packets which compose the content, and outputs video frames and audio frames based on video time stamps attached to the video packets and audio time stamps attached to the audio packets, so that video output timing and audio output timing match (that is, lip-syncing) (for example, refer to patent reference 1)    Patent Reference 1 Japanese Patent Laid-Open No. 8-280008
By the way, in the content receiving apparatus adopting such a configuration, the system time clock of the decoder side and the reference clock of the encode side may not be in synchronization with each other. In addition, the system time clock of the decoder side and the reference clock of the encoder side may have slightly different clock frequencies due to clock jitter of the system time clock.
Further, in the content receiving apparatus, a video frame and an audio frame have different data lengths. Therefore, even if video frames and video frames are output based on video time stamps and video time stamps when the system time clock of the decoder side and the reference clock of the encoder side are not in synchronization with each other, video output timing and audio output timing do not match, resulting in causing lip-sync errors, which is a problem.