The invention relates to storing and/or reading transport protocol data packets and side information associated thereto to and/or from a file having a media data container and a metadata container, as e.g. a file based on the ISO (International Organization for Standardization) base media file format.
Various electronic devices are enabled to receive and present media data streams. Such media data streams can e.g. be received from a digital video broadcasting network that broadcasts media streams in accordance with e.g. the DVB-H Standard (Digital Video Broadcasting—Handhelds) or the DVB-T Standard (Digital Video Broadcasting—Terrestrial).
DVB-T uses a self-contained MPEG-2 (MPEG=Moving Pictures Expert Group) transport stream containing elementary MPEG-2 video and audio streams according to the international standard ISO/IEC 13818 (IEC=International Electrotechnical Commission). The MPEG-2 transport stream is a multiplex used in many of today's broadcast systems. It is a stream multiplex of one or more media programs, typically audio and video but also other data. MPEG-2 transport streams share a common clock and use time-stamped media samples (Access Units, AUs) in all media streams. This enables synchronization of sender and receiver clocks and lip synchronization of audio and video streams.
For DVB-H, elementary audio and video streams are encapsulated in RTP (Real-Time Transport Protocol), UDP (User Datagram Protocol), IP (Internet Protocol), and MPE (Multi-Protocol Encapsulation) for IP data casting. RTP is used for effective real-time delivery of multi-media data over IP networks. Multiplexing is typically done by associating different network ports to each distinct media stream, e.g. one network port for video and another one for audio. Different media usually stem from different sources having different clocks or clock rates. E.g., audio samples have a sample rate depending on the clock rate of an audio sampling device, wherein a frame rate of video frames depends on a video frame grabbing device's clock rate. Such clocks can have inherent frequency errors greater than a few hundred parts-per-million resulting in accumulated errors of tens of seconds per day. The term “clock skew” is defined as this difference in a clock's actual oscillator frequency from its nominal frequency. If a sender's clock operates faster than a receiver's clock, this can lead to packet accumulation at the receiver. If the sender clock operates at slower than the receiver clock, it will result in underfill of receiver buffers. Thus, if the receiver clock rate differs from the sender clock rate, then the receiver buffer(s) will either gradually fill or empty. Further, clock skew may lead to a de-synchronization between related audio and video samples at the receiver.
RTCP (Real-Time Transport Control Protocol) allows clock recovery and synchronization for RTP streams. An RTCP channel is associated with each RTP stream and comprises control information from sender to receiver in form of sender reports (SR) and vice versa. Each RTCP SR includes two timestamps: A NTP (Network Time Protocol) timestamp of a sender's system clock (reference time) and a corresponding media timestamp of the associated RTP stream. These RTCP SRs are sent for both audio and video. From the values of the RTP and NTP times the RTP packets may be set on a time line and the media may be perfectly synchronized.
A streaming service is defined as a set of synchronized media streams delivered in a time-constraint or unconstraint manner for immediate consumption during reception. Each streaming session may comprise audio, video and/or real-time media data like timed text. A user receiving media data for a movie by means of a mobile television, for instance, can watch the movie and/or record it to a file. Commonly, for this purpose the received data packets of the received media stream are de-packetized in order to store raw media data to the file. That is, received RTP packets or MPEG-2 packets are first de-packetized to obtain their payload in form of media data samples. Then, after de-packetizing, obtained media data samples are replayed or stored to the file. The obtained media samples are commonly compressed by formats like the H.264/AVC (AVC=Advanced Video Coding) video format and/or the MPEG-4 HE-AACv2 (HE-AACv2=High-Efficiency Advanced Audio Coding version 2) audio format. When media data samples having such video and/or audio formats are to be stored, they may be stored in a so-called 3GP file format, also known as 3GPP (3rd Generation Partnership Project) file format, or in an MP4 (MPEG-4) file format. Both 3GP and MP4 are derived from the ISO base media file format, which is specified in the ISO/IEC international standard 14496-12:2005 “Information technology-coding of audio-visual objects—part 12: ISO base media file format”. A file of this format comprises media data and metadata. For such a file to be operable, both of these data may be present. The media data is stored in a media data container (mdat) related to the file and the metadata is stored in a metadata container (moov) of the file. Conventionally, the media data container comprises actual media samples. I.e., it may comprise e.g. interleaved, time-ordered video and/or audio frames. Thereby, each media has its own metadata track (trak) in the metadata container moov that describes the media content properties. Additional containers (also called boxes) in the metadata container moov may comprise information about file properties, file content, etc.
Recently, so-called reception hint tracks for files based on the ISO base media file format have been defined by international standardization groups. Those reception hint tracks may be used to store multiplexed and/or packetized streams like e.g. a received MPEG-2 transport stream or RTP packets. Reception hint tracks may be used for a client side storage and playback of received data packets. Thereby, received MPEG-2 TS or RTP packets of one stream are directly stored in reception hint tracks as e.g. pre-computed samples or constructors.
There are two advantages of this approach, compared to de-multiplexing and/or de-packetizing data packets and then writing separate media tracks for every elementary media stream (audio and/or video). Firstly, it lowers necessitated complexity of a receiving device during storage, because no de-multiplexing or other processing of the received data packets is necessary. Only file storage of the received data packets in unmodified form is performed. Secondly, in some cases it is not possible at all to de-multiplex the received data packets to separate media tracks, especially if the media is encrypted at the transport/multiplex level or the packetization scheme is unknown. Thirdly, time-shifting, i.e. write to the file and immediately read from the same file with variable time offset, in a PVR (PVR=Personal Video Recorder) application is made easier because of the first two points.
Playback from reception hint tracks may be done by emulating the normal stream reception and reading the stored data packets from the reception hint track as they were received over IP. Reception hint tracks, like all hint tracks, have transport timing, contrary to media tracks that have media playback timing. Therefore, a reception timestamp of the receiving device is associated to each data packet stored in a reception hint track.
RTP hint tracks in server-side files store only RTP media data packets from one stream and do not contain corresponding side- or control information, like e.g. RTCP information or key messages. RTCP information is generated on the fly by a streaming server, because it describes the current state of the streaming situation, e.g. the timing.
Streaming receivers may recover the sender system clock from reception times and align the receiver's system clock to the sender's system clock to avoid buffer overflow respectively under-run for direct playback. Due to jitter in arrival time (network jitter) of RTP packets or RTCP sender report packets, whatever of these is used for clock recovery, instant clock recovery is not possible. Independent audio and video capture units with unsynchronized sampling clocks may lead to drifting RTP clocks although the media timestamps increase constantly with a fixed rate. RTCP SRs carry the NTP and RTP timestamps for each of the streams and can therefore be used to extract the drift of the involved devices. In many systems there is jitter involved in the creation of RTCP SRs, specifically in the relationship between NTP and RTP clocks. It is therefore common that streaming clients may not achieve perfect lip-synchronization instantly after startup, but need to take a certain number of RTCP SRs into account before lip-synchronization between video and audio streams is accurate. If the sender's system clock needs to be recovered and there is high network jitter, then a certain number of RTP packets or RTCP sender report packets, whatever of these is used for clock recovery, is needed, too. Network jitter and clock drift may be recalculated during a real-time stream reception using information of multiple RTCP SRs as described above, in addition to the RTP timestamps of the related data packets.
Currently, RTP reception hint tracks are specified to only store received data packets of a media stream and do not contain the corresponding RTCP SRs, respectively the timing information from the sender reports. The RTP timestamp of a received RTP packet alone is insufficient to synchronize media data received from different streams. This is because generally each media stream assigns random values to its initial timestamp and initial sequence number, and the timestamp's clock frequency is dependent on the format of the media data carried. The arrival or reception time of the RTP packets may be used to synchronize between streams. The problem with this approach is, however, that RTP does not guarantee packet delivery, nor does it prevent out of band delivery. As a result, synchronization based on the reception time alone cannot guarantee accuracy.
As described above, the most accurate method of synchronization between different RTP streams necessitates waiting for associated RTCP SRs, which contain information enabling conversion between an RTP timestamp and a common timestamp among streams in the NTP timestamp format. These RTCP sender reports are usually sent every five seconds for each stream for a certain bit-rate, wherein the time interval between two RTCP SRs depends on the bit-rate.
Hence, playback of RTP reception hint tracks with accurate timing and lip synchronization is only possible in the following two cases: Firstly, there is no clock drift between the different media clocks and RTCP sender report interstream synchronization data are available for each received RTP packet. This, however, corresponds to an ideal situation which is very unlikely to occur in real environments. Secondly, the receiving device has to take the timing information of the RTCP SRs into account during storage by adjusting the RTP timestamps of the received RTP packets before storing them.
The first case is only a theoretical case and does not happen in practice. The second case puts a high burden on the receiver, as e.g. buffering of the received streams for some seconds would be needed to be able to take several sender RTCP SRs into account for the timing adjustment. This would also affect the ability of instant reading from the same file for time-shifting applications. Furthermore, an original reception situation cannot be recreated after storage, i.e. long-term jitter may not be removed in a processing stage after the complete stream is received and recorded.
Current broadcast systems use key streams (either in-band or out-of-band) for transporting protected keys as side-information that are used for decrypting media data of the related data packets. Typically there is only a loose coupling between a key stream and an encrypted media data stream and not a timing relation.
In the DVB-H and OMA-BCAST (Open Mobile Alliance—Mobile Broadcast Services) a key stream is defined as a separate stream of key messages sent on a different UDP port than the associated media stream. Every key message is sent as a single UDP packet. OMA-BCAST calls these messages short-term key messages (STKM), DVB-H calls them key stream messages (KSM). Storing key messages does not harm security of a streaming system because every key message is bound to the subscription of a streamed service and can therefore only be accessed by authorized subscribers/devices. An actual cryptographic key inside the key message is protected with the service or program key.
Each key has an associated key indicator (key ID), which is also indicated at the associated encrypted media access unit. A decryptor checks for the existence of the key, associated with a key ID in the encrypted access unit.
Synchronization of encrypted media access units and associated key messages is handled by frequently sending the keys with overlapping validity periods. The key is sent prior to the encrypted video packet, marked with the corresponding key indicator. The key is then valid at least as long as the media data is using this particular key.
Storage of the keys as a media track during recording of the file is not practicable since no media timing is associated with the key messages in the stream. Media timing association between the keys and the corresponding encrypted access units can only be made after processing and analyzing the key IDs that take care of the coupling of both the key and the media streams. Only after this analysis it is clear which key is used for which access unit or video frame.