(1) Field of the Invention
This invention relates to a text editing and reproduction apparatus, a content editing and reproduction apparatus, and a text editing and reproduction method, and, more particularly, to a text editing and reproduction apparatus for editing and reproducing text data, a content editing and reproduction apparatus for editing and reproducing content composed of video, audio, and text, and a text editing and reproduction method for editing and reproducing text data.
(2) Description of the Related Art
In recent years content delivery services for delivering various contents to terminal units, such as cellular phones, have widely been used. Stream transmission based on the Moving Pictures Experts Group 4 (MPEG4) standard has widely been used for providing content delivery services and the types of products using MPEG4 have increased.
MPEG4 is an animation format standard for delivering high-quality animation data even over low-speed lines, such as cellular phones or telephone lines. It is expected that MPEG4 will widely be used for, for example, digital television (video conferences, video telephones, and the like), delivery of video or music, for example, via the Internet or to cellular phones, and interactive media (online games and the like).
A basic media file format prescribed by MPEG4 is called MP4. Content in the MP4 file format includes a header section where header information, such as conditions under which media data is reproduced, is stored and a media data section where the media data itself is stored. To edit such content by separating and extracting, usually video data is used as reference.
FIG. 12 shows how to separate and extract media data. Media data includes coded video and audio. That is to say, media data includes video data V, audio data A, and text data T and is also referred to as an elementary stream (ES).
When data included in an arbitrary time interval is separated and extracted from the media data, the leading frame of the separated interval should be an intra-coded (I) frame. When video data V separated and extracted is reproduced, it is necessary that the leading frame should be reproducible by itself. This is why when media data is separated and extracted, the separation is performed so as to make an intra-coded (I) frame which is not coded on the basis of a correlation between frames the leading frame.
As shown in FIG. 12, for example, it is assumed that the interval between 10 and 20 seconds (interval [10 s, 20 s]) is designated as an interval to be extracted. To meet the above condition (that the leading frame of video data V included in a separated interval should be an I frame), the interval [9.8 s, 20.3 s] is actually extracted so that data at 10 seconds and data at 20 seconds will be included.
If video data V included in the interval [9.8 s, 20.3 s] is separated and extracted, then audio data A and text data T included in the interval [9.8 s, 20.3 s] are also separated and extracted. Accordingly, if the interval [10 s, 20 s] is designated, then the determination that the video data V, the audio data A, and the text data T included in the interval [9.8 s, 20.3 s] should be separated is made.
A technique for changing the structure of text data to suit it to streaming has conventionally been proposed (see, for example, Japanese Patent Laid-Open Publication No. 2004-254281, paragraphs [0085]-[0108] and FIG. 1).
As stated above, when media data is separated and extracted, intervals of audio data and text data which are to be separated are determined with video data as reference. In many cases, however, the time length of one sample of text data is several seconds, so there may be a separation point in a sample.
The structure of text data will now be described. FIG. 13 shows syntax for text data. The TimedText syntax is shown as an example of syntax for one sample of text data (text sample) (text data in which time information is included in ornament information is referred to as TimedText).
TimedText is included in an ES and includes 4-byte text length information, a text character string, and ornament information in that order. Data size information regarding this text sample, time information which specifies when to display this text sample on a screen, display information which specifies how to display this text sample, and the like are stored in a header section (not shown).
Syntaxes for ornament information differ among different ornament methods. Karaoke and scroll delay will be shown as examples of ornament information. Karaoke is ornament by which characters are highlighted at designated time (for example, a portion of lyrics to sing to music is displayed in color). In the case of the ornament information of karaoke, highlight start time is designated first by using four bytes, then the number of entries, that is to say, of highlight portions in a text sample is designated, and then a set of highlight end time, a highlight start character, and a highlight end character is repeated by times the number of which is the same as that of the entries.
It is assumed that the text data shown in FIG. 12 is a sample displayed for 15 seconds. Then the text data is separated into 9.8-second text data and 5.2-second text data. When the video/audio data is edited, time information in a header section for the text data is edited so as to display these pieces of text data for 9.8 seconds and 5.2 seconds respectively. By doing so, the correspondence between the text data and the video/audio data separated is maintained.
Traditionally, however, when text data separated and time-modified is packed in a file and is then reproduced, continuity is not maintained and the text data is displayed very unnaturally.
This problem will now be described by using FIGS. 14 through 16. FIG. 14 shows the operation of displaying text data before separation. It is assumed that a text sample T0 (Text0) is the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” (which is a Japanese text corresponding to the English text “the time is just half past ten”), that the text sample T0 is horizontally scrolled from the right to the left of a screen, and that the text sample T0 is displayed for 15 seconds. As shown in FIG. 14, usually the displayed text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” is scrolled from the right to the left of the screen for 15 seconds.
On the other hand, if the text data T is separated at 9.8 seconds with a video I frame as reference, then the text data T is separated into 9.8-second text data and 5.2-second text data.
FIG. 15 shows the operation of displaying the 9.8-second text data. When the video/audio data included in the interval [0 s, 9.8 s] is edited, the text data before the separation point is time-modified to 9.8 seconds. As a result, the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” becomes a sample which is horizontally scrolled from the right to the left of the screen and which is displayed for 9.8 seconds.
In this case, a scroll speed can be calculated in the following way. It is assumed that one row on the screen of the cellular phone is made up of 136 pixels and that one character is made up of 12 dots. The text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” is made up of 32 characters, so (136+12×32)/9.8=53.1 dots/s (1 pixel=1 dot).
FIG. 16 shows the operation of displaying the 5.2-second text data. When the video/audio data included in the interval [9.8 s, 20.3 s] is edited, the text data following the separation point is time-modified to 5.2 seconds. As a result, the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” becomes a sample which is horizontally scrolled from the right to the left of the screen and which is displayed for 5.2 seconds. In this case, a scroll speed is(136+12×32)/5.2=100 dots/s
The media data included in the interval [0 s, 9.8 s] is packed in a file (file f1) and the media data included in the interval [9.8 s, 20.3 s] is packed in a file (file f2). When the two files f1 and f2 are reproduced in succession, the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” is displayed twice at different scroll speeds (the first text is displayed and is scrolled for 9.8 seconds and the second text is displayed and is scrolled for 5.2 seconds). Accordingly, unnatural reproduction is performed. (If the text data is separated into 14-second text data and 1-second text data, then the second text is displayed and is scrolled for 1 second. In this case, the second text disappears from the screen in a short time, so a user feels a deep malaise.)
The most natural method for displaying the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” is as follows. When the two files f1 and f2 are reproduced in succession, part of the text packed in the file f1 should be displayed for 9.8 seconds and the rest of the text packed in the file f2 should be displayed for 5.2 seconds. That is to say, the text “GENZAIJIKOKU WA 10JI30PUN CHODO DESU” packed in the files f1 and f2 should be displayed once and be scrolled at a speed of 34.7 (=(136+12×32)/15) dots/s for a total of 15 seconds.
Traditionally, however, when media data is edited by performing separation and extraction with video data as reference, text data is simply time-modified with reference to the video/audio data. As a result, when the text data is reproduced, continuity is not maintained and the text data is displayed unnaturally. Moreover, the display of the text data is not synchronized with video and audio. These problems are not taken into consideration at all in the conventional technique (Japanese Patent Laid-Open Publication No. 2004-254281).