SA (Service and System Aspect) WG4 group of 3GPP (Third Generation Partnership Project), which is an organization that develops global standards of third generation mobile communications (W-CDMA), has developed multimedia distribution standard TS26.234. Version 5.2.0 of multimedia distribution standard TS26.234 extends a file of MP4 (ISO/IEC 14496-1:2001) format usable in download-type multimedia distribution, and defines the data structure of text data (timed text). This makes it possible to play not only video and audio but also text in service that plays the MP4 file as downloading.
Information notification using text is very important as information notification means because information to be transmitted can be directly transmitted to a user and the amount of data may be extremely small as compared with video. In the aforementioned service that plays the MP4 file as downloading, the text is transmitted as an independent track instead of the fact that the video and the text are combined to be coded and the result is transmitted, and this reduces a case in which the text cannot be read since it is defaced and makes it possible to efficiently send information notification.
Moreover, in timed text defined by 3GPP, a part of the text can be modified, moved, or a link to another URL can be adhered to a character string (style, highlight, karaoke, text box, blink, scroll, hyperlink, and the like). This allows playback of information to be transmitted in various expression formats.
Here, the data structure of timed text defined by 3GPP is explained using FIG. 1.
An MP4 file 10 includes a header section 20 and a data section 30. The header section 20 includes a track header 40, a sample description 50, and a sample table 60. The data section 30 includes text samples 70, 71 . . . .
The track header 40 is information relating to playback of the timed text, and includes information of the layout (size of display region, relative position with video), layer (hierarchical relationship with other media such as video and the like), playback time of the timed text, file playback time and date, and a time scale of Time-to-Sample-box 61 to be described later, and the like.
The sample description 50 includes multiple sample entries 51, 52 . . . .
The sample entries 51, 52 . . . are information relating to a default format of the text samples 70, 71 . . . including the presence or absence of a scroll and its direction, horizontal and vertical justification positions, background color, font name, font size, and the like.
The sample table 60 includes a Time-to-Sample-box 61, a sample-size-box 62, and a sample-to-chunk-box 63. The Time-to-Sample-Box 61 includes information 65, 66 . . . relating to playback time of text samples 70, 71 . . . in the order of arrangement of the text samples 70, 71 . . . . The time scales of values stored by information 65, 66 . . . are designated by the track header 40. More specifically, the track header 40 stores one-second resolution as a time scale. For example, when the value of the time scale stored by the track header 40 is [1000], resolution in 1/1000 second units can be obtained. Accordingly, the values obtained by converting the playback times of the text samples 70, 71 . . . to units of seconds become values obtained by dividing information 65, 66 . . . by the values of the time scale stored by the track header 40. For example, when the value of the time scale is [1000], a value [3400] indicated by information 66 means that the text sample 71 is played for 3.4 seconds. The following explanation is given on assumption that the value of the time scale is [1000]. The sample-size-box 62 includes information 67, 68 . . . relating to data lengths of the text samples 70, 71 . . . in the order of arrangement of the text samples 70, 71 . . . . This makes it possible for the playing side to detect each boundary between information of the respective text samples 70, 71 . . . . The sample-to-chunk-box 63 includes information that associates the text samples 70, 71 . . . with the sample entries 51, 52 . . . .
The text sample 70 includes a text 75, a data length 76 of the text 75, and a modifier 77. The modifier 77 is information on an optional format of the text 75, and information for playing the text 75 by highlight, karaoke, blink, hyperlink, and the like. Since the other text samples 71 . . . have the same data structure as that of the text sample 70, the explanation is omitted.
A specific explanation is next given of playback of the timed text using FIG. 2.
First of all, a specific structure of the sample entry 51 is explained with reference to FIG. 2A. The other sample entries, 52 . . . have the same structure and the explanation is omitted. The sample entry 51 includes the presence or absence of the scroll and the direction (“Display Flags”), horizontal and vertical justification positions (“Horizontal justification,” “Vertical justification”) in a display region, a background color (“bgColor”) designated by RGB values and transparency, a display region (“TextBox”), a font name (“fontTable,” “font ID”), a font size (“fontSize”), a style (“faceStyle”) such as bold, italic, underline, etc, and a font color (“textColor”) designated by RGB values and transparency. Additionally, data (“startChar,” “EndChar”), which designates a range to which this format is applied, always takes a value of [0], and shows that this format is applied to the whole range of text in the text sample to which the format designated by the sample entry 51 is applied. Each value of the sample entry 51 shown in FIG. 2 means that the default format of the text 75 is designated so that the background color is white, the font color is black, and the style is normal.
An explanation is next given of the specific structure of the modifier 77 with reference to FIG. 2B. The modifier 77 includes a data length (“modifierSize”) of the modifier 77, a designation (“modifierType,” “entryCount”) of an optional format of the text 75, a designation (“startChar,” “EndChar”) of the range of the text 75 to which the optical format is applied, a font name (“font ID”), a font size (“fontSize”), a style (“faceStyle”) such as boldface, italic, underline, etc, and a font color (“textColor”) designated by RGB values and transparency. The designation of this optional format is applied with priority higher than the format designated by any one of the sample entries 51, 52 . . . . The respective values of the modifier 77 shown in FIG. 2B mean that fifth to eighth characters of the text 75 are expressed in boldface type.
FIG. 2C illustrates a playback state of the text sample 70 to which the aforementioned format is applied. For example, when the content indicated by the text 75 is [It's fine today], [fine] of the fifth to eighth characters is played in boldface type. Moreover, it is shown from the value [1000] of information 65 first arranged in the Time-to-Sample-Box 61 that the playback time is 1000 [milliseconds] (FIG. 1).
At the time of playing the MP4 file having the aforementioned structure, the MP4 file is downloaded in advance by a receiving terminal, and the MP4 file is played by the receiving terminal after completion of the download. TCP, which is a reliable transmission protocol, is normally used in downloading the MP4 file, and it is guaranteed that the MP4 file is received in a complete form by the receiving terminal.
While, in the service that distributes media data including video and audio, streaming distribution is increasingly adopted in place of the download type. In streaming distribution, the process of receiving media data by the receiving terminal and the process of playing the received media data are performed in parallel. For this reason, there is an advantage in which waiting time from when the media data is requested until a playback is performed is reduced even when long-time media data is played. Moreover, this is the distribution format suitable for distributing media data to be broadcasted live.
In such streaming distribution, RTP/UDP is used as the transmission protocol for transmitting media data in place of TCP. TCP is a reliable protocol that ensures transmission of data, while RTP/UDP is an unreliable protocol that excels in real-time performance and is suitable for streaming distribution.
As a scheme for transmitting static media such as and static image using RTP, there is Generic RTP Payload Format for Time-lined static Media. This is a scheme in which a duration header is provided to express playback time (duration) and has a feature in which playback time is sent to the receiving side. Moreover, the use of RTP instead of TCP makes it possible to employ real-time transmission of the static media.
However, in the case of the stream type distribution using RTP/UDP, a packet including media data is lost on a wired network and a radio transmission path in some cases, so that the text to be played cannot displayed. Since the receiving terminal receives no data in any of cases where the packet is lost and where media data to be played next is not transmitted, there is a problem that the receiving terminal cannot determine whether there is no media data to be next displayed or media data is lost in the course of transmission to make it impossible to execute the display. For this reason, it is impossible to notify the user of the loss of media data by executing such a display that “data cannot be received now.”
While, in the case of streaming using RTP, there is a case in which packet loss occurs depending on the condition of the transmission path. In the packet transmission using RTP, a packet loss is detected from a sequence number (SN) given to RTP. Namely, when a packet whose SN is 5 is received where a packet whose SN is 4 is not received, it is determined that an RTP packet whose SN is 4 is lost. In the case of continuous media such as speech and video data, a transmission interval between the respective RTP packets is short, about several tens of milliseconds to 100 milliseconds, so that such a packet loss determination method is allowed to be executed. In the case where the packet loss has a large influence upon quality, a retransmission request is executed after determination of the packet loss, thereby making it possible to prevent quality deterioration. In this case, in order to absorb delay due to retransmission, pre-buffering time for obtaining data for 2 to 3 seconds in advance is generally provided before the playback of media starts.
However, in the case where the streaming using RTP is applied to text media such as timed text and static media including JPEG data, the following problems occur. Since the playback time of static media, that is, the time for displaying the same text and the same static image is generally a few seconds to dozen or so seconds, an RTP packet transmission interval becomes a few seconds to dozen or so seconds accordingly. The RTP packet transmission interval is equal to time required for packet loss detection, and is longer than the general pre-buffering time. Accordingly, it is difficult to absorb time required for packet loss detection by the pre-buffing time. Moreover, if the pre-buffering time is increased to, for example, about 10 to 20 seconds, there is a problem that user comfort is severely damaged.