Recently, thanks to development of digital technologies, data, representing some content such as video like moving picture or still picture or music, can now be encoded and stored as an encoded data stream on a storage medium such as an optical disk or a hard disk. According to an MPEG standard such as ISO 11172 or ISO 13818, for example, audio is encoded as an audio encoded stream and video is encoded as a video encoded stream. Thereafter, data packets storing respective encoded data are arranged time-sequentially and multiplexed together, thereby making up an encoded data stream. Such multiplexing processing to make an encoded stream is called “system encoding”. A system-encoded multiplexed data stream (i.e., a system stream) is transmitted along a single transmission line on a data packet basis, and then processed by a player. As a result, video and audio are played back.
Portions (a) through (d) of FIG. 1 show the data structure of a data stream 10. A player sequentially breaks down the data stream 10 shown in portion (a) of FIG. 1 into the data structures shown in portions (b) and (a) and outputs video and audio in the form shown in portion (d).
Portion (a) of FIG. 1 shows the data structure of the data stream 10, which may be an MPEG-2 transport stream, for example.
The data stream 10 is made up of video packets Vn (where n=1, 2, . . . ) and audio packets An (where n=1, 2, . . . ) that are multiplexed together. Each of those packets is comprised of a packet header and a payload that follows the packet header. Video-related data is stored in the payload of a video packet, while audio-related data is stored in the payload of an audio packet.
Portion (b) of FIG. 1 shows the data structure of a packetized elementary stream (PES) 11. The PES 11 is made by collecting the payload data of respective packets that form the data stream 10. The PES 11 is composed of a plurality of PES packets, each of which is comprised of a PES header and a PES payload.
Portion (c) of FIG. 1 shows the format of a video/audio elementary stream (ES). The video ES 12v includes a plurality of data units, each consisting of a picture header, picture data and a presentation time stamp VPTS defining the presentation time of the picture. Each set of picture data represents a single frame/field of picture either by itself or in combination with the picture data to be decoded earlier and/or later than itself. Likewise, the audio ES 12a also includes a plurality of data units, each consisting of a header, audio frame data and a presentation time stamp APTS defining the output timing of the audio frame. The presentation time stamp APTS, VPTS is data of 33 bits according to the MPEG-2 standard and stored in an area (Presentation_Time_Stamp) of the header (i.e., PES-H shown in portion (b) of FIG. 1) of the PES packet.
Portion (d) of FIG. 1 shows the video pictures and audio frames to be output. Each of the video pictures 13-1 and 13-2 is a single picture and represented by the picture data stored in its associated video ES 12v. The presentation time of each picture is designated by the presentation time stamp VPTS in its associated video ES 12v. By switching pictures to present in accordance with that information, moving pictures are presented on the screen of a video player. The output timing of each audio frame is designated by the presentation time stamp APTS in its associated audio ES 12a. By outputting each audio frame in accordance with that information, audio is output through a loudspeaker.
FIG. 2 shows the arrangement of functional blocks in a conventional player 120 that can play back the data stream 10 shown in portion (a) of FIG. 1. The player 120 acquires respective packets of the data stream 10, decodes it to the video and audio elementary streams based on the packets acquired, and then outputs the reproduced video pictures and audio frames.
Now consider what processing needs to be done by the player 120 to read two data streams No. 1 and No. 2 back to back and to play back the video pictures and audio frames of each data stream. Each of these data streams has the data structure shown in portion (a) of FIG. 1. When a stream reading section 1201 reads these data streams back to back, a single data stream is transmitted into the player 120. Thus, in the following description, a data portion of this single data stream corresponding to. Data Stream No. 1 will be referred to herein as a “first interval”, while another data portion thereof corresponding to Data Stream No. 2 will be referred to herein as a “second interval”. Also, the point where the streams to play switch each other will be referred to herein as a “boundary”. The boundary is the end point of the first interval and the start point of the second interval.
In a data stream, audio and video packets are multiplexed together. The audio and video packets to play back at the same time are arranged in series and transmitted as a data stream. Accordingly, if reading a data stream is stopped, then just the audio or the video may be present even though the audio and video should be played back synchronously with each other. As a result, one of the audio and video may have relatively short playback duration and the other relatively long playback duration. This phenomenon will occur in a portion of the boundary near the end point of the first interval described above. If such a data stream is decoded, then the video may have been played fully but the audio may be partially missing, or the audio may have been reproduced fully but the video may be partially missing, in the vicinity of the end point of the first interval (e.g., one second before the presentation end time of the first interval). In addition, since reading is started halfway even at the start point of the second interval, audio may be missing for a while after the video has started being played back or video may be missing for a while after the audio has started being reproduced.
Particularly if the video and audio of the first and second intervals are played back continuously, then audio and video, which belong to mutually different intervals before and after the boundary and which should not be played back synchronously with each other, happen to be played back at the same time. That is why the player 120 inserts a dummy packet in switching the objects to read. FIG. 3(a) shows a dummy packet 1304 inserted between the first and second intervals. A dummy packet inserting section 1202 inserts the dummy packet 1304 into the end of a data stream 1302 and then combines a data stream 1303 with the data stream 1302. In this manner, a data stream 1301, which can be divided into the first and second intervals at the dummy packet 1304, can be obtained.
The data stream 1302 for the first interval, dummy packet 1304, and data stream 1303 for the second interval are continuously supplied to a stream splitting section 1203. On receiving the data stream 1302 for the first interval, the stream splitting section 1203 separates audio packets (such as A11) and video packets (such as V11, V12, V13) from the stream 1302 and then sequentially stores them in a first audio input buffer 1205 and a first video input buffer 1212 while decoding them to the audio ES and video ES (i.e., while performing system decoding).
When the stream splitting section 1203 detects the dummy packet 1304, a first switch 1204 is turned, thereby connecting the stream splitting section 1203 to a second audio input buffer 1206. At the same time, a second switch 1211 is also turned, thereby connecting the stream splitting section 1203 to a second video input buffer 1213.
Thereafter, the stream splitting section 1203 separates audio packets (such as A21) and video packets (such as V21, V22, V23) from the data stream 1303 for the second interval and then sequentially stores them in the second audio input buffer 1206 and the second video input buffer 1213 while decoding them to the audio ES and video ES (i.e., while performing system decoding).
An audio decoding section 1208: reads the audio ES from the first audio input buffer 1205 by way of a third switch 1207, decodes it (i.e., performs elementary decoding), and then sends out resultant audio frame data to an audio output buffer 1209. An audio output section 1210 reads out the decoded audio frame data from the audio output buffer 1209 and outputs it.
Meanwhile, a video decoding section 1215 reads the video stream from the first video input buffer 1212 by way of a fourth switch 1214, decodes it (i.e., performs elementary decoding), and then sends out resultant video picture data to a video output buffer 1216. A video output section 1217 reads out the decoded video picture data from the video output buffer 1216 and outputs it.
The audio decoding section 1208 and video decoding section 1215 are controlled by an AV synchronization control section 1218 so as to start and stop decoding at designated timings. The audio output section 1210 and video output section 1217 are also controlled by the AV synchronization control section 1218 so as to start and stop outputting at designated timings.
When the respective video and audio packets have been read from the first interval, the third and fourth switches 1207 and 1214 are turned so as to connect the second audio input buffer 1206 to the audio decoding section 1208 and the second video input buffer 1213 to the video decoding section 1215, respectively. Thereafter, the same decoding and output processing is carried out just as described above.
FIG. 3(b) shows the timing relation between the respective presentation times of the audio and video streams 1305 and 1306 for the first interval and the audio and video streams 1307 and 1308 for the second interval. Each of these streams is supposed to be the elementary stream (ES) shown in portion (c) of FIG. 1. The presentation times of the audio frames and video pictures forming these streams are defined by the presentation time stamps APTS and VPTS as shown in portion (c) of FIG. 1.
As can be seen from FIG. 3(b), the presentation end time Ta of the audio stream 1305 does not agree with the presentation end time Tb of the video stream 1306 in the first interval. It can also be seen that the presentation start time Tc of the audio stream 1307 does not agree with the presentation start time Td of the video stream 1308 in the second interval, either.
A player that can play back a moving picture continuously before and after a skip point is disclosed in Japanese Patent Application Laid-Open Publication No. 2000-36941, for example. This player will be referred to herein as a “first conventional example”. Hereinafter, it will be described how to play back the video streams 1306 and 1308 shown in FIG. 3(b) continuously by using such a player.
As shown in FIG. 3(b), in the interval between the times Ta and Tb just before the boundary, the audio stream 1305 is missing. That is why the audio decoding section 1208 once stops decoding after having decoded the audio stream for the first interval. Next, the audio stream 1307 for the second interval is input from the second audio input buffer 1206 to the audio decoding section 1208.
In a part of the second interval between the times Tc and Td, the video stream 1308 is missing. That is why the portion of the audio stream between the times Tc and Td is not decoded but discarded. This discarding processing is carried out by the audio decoding section 1208 that shifts the reading address on the second input buffer 1206 to an address where a portion of data corresponding to the interval between the times Tc and Td is stored. This discarding processing can be done in a much shorter time than the processing of decoding the audio stream. Thus, the audio decoding section 1208 waits for the AV synchronization control section 1218 to instruct it to restart decoding the audio stream from the time Td on. Meanwhile, before the audio decoding section 1208 enters the standby state of waiting for the instruction to restart decoding from the time Td on, the video decoding section 1215 decodes and outputs the video stream up to the time Tb of the first interval.
Suppose the rest of the video stream from the post-boundary time Td on has been stored in the second video input buffer 1213 when the video stream has been decoded up to the time Tb. In that case, the video decoding section 1215 starts decoding the rest of the video stream from the time Td on immediately after having decoded the video stream up to the time Tb. As a result, the video up to the time Tb and the video from the time Td on are played back continuously. When the video stream restarts being decoded at the time Td, the AV synchronization control section 1218 activates the audio decoding section 1208 that has been in the standby mode, thereby making the decoding section 1208 start to decode the audio stream 1307 at the time Td. In this manner, the video streams can be played back continuously and the audio and video can be output synchronously with each other across the boundary.
Optionally, video can also be played back continuously across the boundary even by the technique disclosed in Japanese Patent Application Laid-Open Publication No. 2002-281458 or Japanese Patent Application Laid-Open Publication No. 10-164512. For example, according to Japanese Patent Application Laid-Open Publication No. 2002-281458, a portion of the audio stream in the interval between the times Tc and Td shown in FIG. 3(b) is discarded by using presentation time stamps added to the audio streams 1305 and 1307, thereby realizing continuous playback across the boundary. It should be noted that by using the presentation time stamps, if a video stream is missing with respect to an audio stream, then a portion of the audio stream may be discarded. As a result, the load of processing the audio stream unnecessarily can be saved and the streams of the second interval can be read quickly. Consequently, the video can be played back continuously before and after the boundary.
According to the conventional technique, video can be played back continuously across the boundary but the video may sometimes be out of sync with audio. This problem will be described in detail with reference to FIGS. 4(a) and 4(b).
FIG. 4(a) shows a data stream 1401 for which three intervals are defined by two boundaries. The data stream 1401 includes two dummy packets No. 1 and No. 2. Dummy Packet No. 1 is inserted after an audio packet A11 of the data stream 1402 for the first interval. Thereafter, the data stream 1403 for the second interval is read out. Subsequently, Dummy Packet No. 2 is inserted after the last video packet V22 of the data stream 1403. And then the data stream 1404 for the third interval is read out.
It should be noted that only video packets V21 and V22 are included in the second interval and there is no audio packets there in this case. This means that a short interval corresponding to just several video frames at most is defined as the second interval and that there are no audio packets, which are long enough to be a decodable audio frame, within the data stream 1403 for that interval. Such an interval is generated when a data stream recorded in compliance with the MPEG-2 standard is edited with temporally very short intervals specified.
FIG. 4(b) shows the timing relation between the respective presentation times of audio and video streams 1405 and 1406 for the first interval, a video stream 1407 for the second interval, and audio and video streams 1408 and 1409 for the third interval. In FIG. 4(b), each stream is also supposed to be a stream that has been decoded down to the level of elementary stream (ES) shown in portion (c) of FIG. 1.
First, it will be described how to play back video. Before and after Boundary No. 1, picture data up to the video packet V11 of the first interval is stored in the first video input buffer 1212 and picture data of the video packets V21 and V22 of the second interval is stored in the second video input buffer 1213. Every data will be decoded sequentially after that to play back video continuously. Subsequently, after Boundary No. 2, the storage location of the video stream for the third interval is switched into the first video input buffer 1212 again. Data is decoded under a similar control to Boundary No. 1 and video is output continuously.
Next, audio reproducing processing will be described. First, at a time Ta, the audio decoding section 1208 once stops decoding and the storage location of the audio stream is changed from the first audio input buffer 1205 into the second audio input buffer 1206. Next, the data stream of the third interval is read out from the storage medium 121 and the audio stream of the third interval is stored in the second audio input buffer 1206.
The conventional player uses presentation time stamps to decode an audio stream and reproduce audio. If the presentation time stamp provided for the video stream 1407 for the second interval and the presentation time stamp provided for the video stream 1409 for the third interval simply increase (particularly when the values of the presentation time stamps increase monotonically in the interval between the times Tc and Tf), then the processing can be advanced smoothly. The audio decoding section 1208 and audio output section 1210 may stand by until the video decoding section 1215 and video output section 1217 finish their processing at the time Tf. Then, the audio decoding section 1208 and audio output section 1210 may start processing at the time Tf and output audio synchronously with the video.
However, the presentation time stamps provided for the data streams of the respective intervals is not regulated among the streams. That is why it is impossible to determine in advance, or predict, the magnitudes of the presentation time stamp values of the respective intervals. Accordingly, if the playback is controlled in accordance with the presentation time stamps, data that should not be discarded may be lost by mistake and other inconveniences may be caused, thus interfering with desired continuous playback. For example, supposing the value APTS_f of the presentation time stamp of the audio frame to be output at the time Tf is smaller than the value VPTS_c of the presentation time stamp of the video picture to be output at the time Tc, then the conventional player discards the audio stream for the third interval before or while the video of the second interval is played back. Particularly when APTS_f is much smaller than VPTS_c, a huge quantity of data of the audio stream of the third interval is discarded. In that case, even after the video of the third interval has started to be played back, no audio will be output at all.
Also, if the value APTS_f of the presentation time stamp at the time Tf is equal to or greater than the value VPTS_c of the presentation time stamp of the top video picture of the second interval and equal to or less than the value VPTS_d of the presentation time stamp of the last video picture, then the audio of the third interval, which should start being reproduced at the time Tf, starts being reproduced while the video of the second interval is being played back.
An object of the present invention is to play back audio and video synchronously with each other, with no time lag allowed between them, in playing a plurality of data streams continuously.