The general principle of Media streaming over HTTP is illustrated in FIG. 3. Most of the new protocols and standards for adaptive media streaming over HTTP are based on this principle.
A media server 300 streams data to a client 310. The media server stores media presentations. For example, media presentation 301 contains audio and video data. Audio and video may be interleaved in a same file. The way the media presentation is built is described in what follows with reference to FIG. 4a. The media presentation is temporally split into small independent and consecutive temporal segments 302a, 302b and 302c, such as MP4 segments, that can be addressed and downloaded independently. The downloading addresses (HTTP URLs) of the media content for each of these temporal segments are set by the server to the client. Each temporal segment of the audio/video media content is associated with one HTTP address.
The media server also stores a manifest file document 304 (described in what follows with reference to FIG. 5) that describes the content of the media presentation including the media content characteristics (e.g. the type of media: audio, video, audio-video, text etc.), the encoding format (e.g. the bitrate, the timing information etc.), the list of temporal media segments and associated URLs. Alternatively, the document contains template information that makes it possible to rebuild the explicit list of the temporal media segments and associated URLs. This document may be written using the eXtensible Markup Language (XML).
The manifest file is sent to the client. Upon receipt of the manifest file during a step 305, the client is informed of the association between temporal segments of the media contents and HTTP addresses. Also, the manifest file provides the client with the information concerning the content of the media presentation (interleaved audio/video in the present example). The information may include the resolution, the bit-rate etc.
Based on the information received, the HTTP client module 311 of client can emit HTTP requests 306 for downloading temporal segments of the media content described in the manifest file. The server's HTTP responses 307 convey the requested temporal segments. The HTTP client module 311 extracts from the responses the temporal media segments and provides them to the input buffer 307 of the media engine 312. Finally, the media segments can be decoded and displayed during respective steps 308 and 309.
The media engine 312 interacts with the DASH control engine 313 in order to have the requests for next temporal segments to be issued at the appropriate time. The next segment is identified from the manifest file. The time at which the request is issued depends on whether or not the reception buffer 307 is full. The DASH control engine 313 controls the buffer in order to prevent it from being overloaded or completely empty.
The generation of the media presentation and the manifest file is described with reference to FIG. 4a. During steps 400 and 401, audio and video data are acquired. Next, the audio data are compressed during 402. For example, the MP3 standard can be used. Also, the video data are compressed in parallel during step 403. Video compression algorithms such as MPEG4, MPEG/AVC, SVC, HEVC or scalable HEVC can be used. Once compression of audio and video data has been performed, audio and video elementary streams 404, 405 are available. The elementary streams are encapsulated during a step 406 into a global media presentation. For example, the ISO BMFF standard (or the extension of the ISO BMFF standard to AVC, SVC, HEVC, scalable extension of HEVC etc.) can be used for describing the content of the encoded audio and video elementary streams as a global media presentation. The encapsulated media presentation 407 thereby obtained is used for generating, during step 408, an XML manifest file 409. Several representations of video data 401 and audio data 400 can be acquired, compressed, encapsulated and described in the media presentation 407.
For the specific case of MPEG/DASH streaming protocol illustrated in FIG. 4b, the manifest file is called “Media Presentation Description” (or “MPD” file). The root element of the file is the MPD element that contains attributes applying to all the presentation plus DASH information like profile or schema. The media presentation is split into temporal periods represented by a Period element. The MPD file 410 contains all the data related to each temporal period. By receiving this information, the client is aware of the content for each period of time. For each Period 411, AdaptationSet elements are defined.
A possible organization is to have one or more AdaptationSet per media type contained in the presentation. An AdaptationSet 412 related to video contains information about the different possible representations of the encoded videos available at the server. Each representation is described in a Representation element. For example, a first representation can be a video encoded with a spatial resolution of 640×480 and compressed with a bit rate of 500 kbits/s. A second representation can be the same video but compressed with a bit rate of 250 kbits/s.
Each video can then be downloaded by HTTP requests if the client knows the HTTP addresses related to the video. The association between the content of each representation and the HTTP addresses is done by using an additional level of description: the temporal segments. Each video representation is split into temporal segments 413 (typically few seconds). Each temporal segment comprises content stored at the server that is accessible via an HTTP address (URL or URL with one byte range). Several elements can be used for describing the temporal segments in the MPD file: SegmentList, SegmentBase or Segment Template.
In addition, a specific segment is available: the initialization segment. The initialization segment contains MP4 initialization information (if the video has been encapsulated using the ISO BMFF or extensions thereof) that describes the encapsulated video stream. For example, it helps the client to instantiate the decoding algorithms related to the video.
The HTTP addresses of the initialization segment and the media segments are indicated in the MPD file.
In FIG. 5, there is shown an exemplary MPD file. Two media are described in the MPD file shown. The first one is an English audio stream and the second one is a video stream. The English audio stream is introduced using the AdaptationSet tag 500. Two alternative representations are available for this audio stream:                the first representation 501 is an MP4 encapsulated elementary audio stream with a bit-rate of 64000 bits/sec. The codec to be used for handling this elementary stream (after MP4 parsing) is defined in the standard by the attribute codecs having the value: ‘mp4a.0x40’. It is accessible via a request at the address formed by the concatenation of the BaseURL elements in the segment hierarchy: <BaseURL>7657412348.mp4</BaseURL>, which is a relative URI. The <BaseURL> being defined at the top level in the MPD element by ‘http://cdn1.example.com/’ or by ‘http://cdn2.example.com/’ (two servers are available for streaming the same content) is the absolute URI. The client can then request the English audio stream from the request to the address ‘http://cdn1.example.com/7657412348.mp4’ or to the address ‘http://cdn2.example.com/7657412348.mp4’.        the second representation 502 is an MP4 encapsulated elementary audio stream with a bit-rate of 32000 bits/sec. The same explanations as for the first representation 501 can be made and the client device can thus request this second representation 502 by a request at either one of the following addresses:        
‘http://cdn1.example.com/3463646346.mp4’ or
‘http://cdn2.example.com/3463646346.mp4’.
The adaptation set 503 related to the video contains six representations. These representations contain videos with different spatial resolutions (320×240, 640×480, 1280×720) and with different bit rates (from 256000 to 2048000 bits per second). For each of these representations, a respective URL is associated through a BaseURL element. The client can therefore choose between these alternative representations of the same video according to different criteria like estimated bandwidth, screen resolution etc. (Note that, in FIG. 5, the decomposition of the Representation into temporal segments is not illustrated, for the sake of clarity.)
FIG. 5a shows the standard behavior of a DASH client. FIG. 5b shows a tree representation of an exemplary manifest file (description file or MPD) used in the method shown in FIG. 4a. 
When starting a streaming session, a DASH client starts by requesting the manifest file (step 550). After waiting for the server's response and receiving the manifest file (step 551), the client analyzes the manifest file (step 552), selects a set ASij of AdaptationSets suitable for its environment (step 553), then selects, within each AdaptationSet ASij, a Representation in the MPD suitable for example for its bandwidth, decoding and rendering capabilities (step 554).
The DASH client can then build in advance the list of segments to request, starting with initialization information for the media decoders. This initialization segment has to be identified in the MPD (step 555) since it can be common to multiple representations, adaptation sets and periods or specific to each Representation or even contained in the first media segment.
The client then requests the initialization segment (step 556). Once the initialization segment is received (step 557), the decoders get initiated (step 558).
The client then requests first media data on a segment basis (step 560) and buffers a minimum data amount (thanks to the condition at step 559) before actually starting decoding and displaying (step 563). These multiple requests/responses between the MPD download and the first displayed frames introduce a startup delay in the streaming session. After these initial steps, the DASH streaming session continues in a standard way, i.e. the DASH client adapts and requests the media segments one after the other.
The current DASH version does not provide description of Region-Of-Interest within the manifest files. Several approaches have been proposed for such description.
In particular, components of media contents can be described using SubRepresentation elements. These elements describe the properties of one or several components that are embedded in a Representation. In FIG. 6, there is shown an example of a DASH manifest file describing tile tracks as components of a video. For the sake of conciseness and clarity, only one Period 600 is represented. However, subsequent period elements would be organized in a same fashion. In part 601, a first adaptation set element is used for describing a base layer of the scalable video. For example, the video is encoded according to SVC or HEVC scalable. In part 602, a second adaptation set is used for describing the highest resolution layer of the scalable video. For non-scalable video, only the second adaptation set 602 would be present, without dependency to the base layer, i.e. the dependencyId attribute. In this second adaptation set 602, a single representation 603 is described, namely the one that corresponds to the displayable video. The representation is described as a list of segments 610 with respective URLs for client requests.
Thus, the representation depends on another representation identified by ‘R1’ (dependencyId attribute), actually the base layer representation from the first adaptation set 601. The dependency forces the streaming client to first request the current segment for base layer before getting the current segment for the enhancement layer. This cannot be used to express dependencies with respect to tile tracks because the tracks that would be referenced this way would be automatically loaded by the client. This is something to be avoided, since it is up to the user to select the tiles of interest for him anytime during the media presentation. Therefore, in order to indicate the dependencies between the composite track and the tile tracks the SubRepresentation element is used. The displayable video is described as a list of sub-representations 604 to 608. Each sub representation actually represents a track in the encapsulated MP4 file. Thus, there is one sub-representation per tile (four tiles in the present example) plus one sub-representation for the composite track 608. Each sub-representation is described by a content component element 614 to 618 in order to indicate whether it corresponds to a tile track 614, 615, 616 and 617 or to the composite track 618. The Role descriptor type available in DASH/MPD is used with a specific scheme for tiling. The Role descriptor also indicates the position of the tile in the full-frame video. For example the component 614 describes the tile located at the top left of the video (1:1 for first in row and first in column). The dimensions of the tiles, width and height, are specified as attributes of the sub representation as made possible by MPD. Bandwidth information can also be put here for helping the DASH client in the determination of the number of tiles and the selection of the tiles, according to its bandwidth. Concerning the composite track, it has to be signalled in a different way than the tile tracks since it is mandatory to be able, at the end of the download, to build a video stream that can be decoded. To that purpose, two elements are added into the description. Firstly, the descriptor in the related content component 618 indicates that it is the main component among all the components. Secondly, in the sub representation, a new attribute ‘required’ is added in order to indicate to the client that the corresponding data have to be requested. All requests for the composite track or for one or more of the tile tracks are computed from the URL provided in the segment list 610 (one per time interval). In the example, “URL_X” combined with “BaseURL” at the beginning of the MPD provides a complete URL which the client can use for performing an HTTP GET request. With this request, the client would get the data for the composite track and all the data for all the tile tracks. In order to optimize the transmission, instead of the request, the client can first request the segment index information (typically the “ssix” and/or “sidx” information in ISO BMFF well known by the man skilled in the art), using the data available from the index_range attribute 620. This index information makes it possible to determine the byte ranges for each of the component. The DASH client can then send as many HTTP GET requests with appropriate byte range as selected tracks (including the required composite track).
When starting a streaming session, a DASH client requests the manifest file. Once received, the client analyzes the manifest file, selects a set of AdaptationSets suitable for its environment. Next, the client selects in the MPD, within each AdaptationSet, a Representation compatible with its bandwidth, decoding and rendering capabilities. Next, it builds in advance the list of segments to be requested, starting with initialization information for the media decoders. When initialization information is received by the decoders, they are initialized and the client requests first media data and buffers a minimum data amount before actually starting the display.
These multiple requests/responses may introduce delay in the startup of the streaming session. The risk is for service providers to see their clients leaving the service without starting to watch the video. It is common to name this time between the initial HTTP request for the first media data chunk, performed by the client, and the time when the media data chunk actually starts playing as the start-up delay. It depends on the network round-trip time but also on the size of the media segments.
Server Push is a useful feature for decreasing web resource loading time. Such servers are discussed with reference to FIGS. 1a to 1e. 
In FIG. 1b, there is shown that in HTTP/2 exchanges, a request must be sent for every resource needed: resources R1 to R4 and sub-resources A to I (as shown in FIG. 1a). However, when using the push feature by servers, as illustrated in FIG. 1c, the number of requests is limited to elements R1 to R4. Elements A to I are “pushed” by the server to the client based on the dependencies shown in FIG. 1a, thereby making the associated requests unnecessary.
Thus, as illustrated in FIGS. 1b and 1c, when servers use the push feature, the number of HTTP round-trips (request+response) necessary for loading a resource with its sub-resources is reduced. This is particularly interesting for high-latency networks such as mobile networks.
HTTP is the protocol used for sending web resources, typically web pages. HTTP implies a client and a server:                The client sends a request to the server;        The server replies to the client's request with a response that contains a representation of the web resource.        
Requests and responses are messages comprising various parts, notably the HTTP headers. An HTTP header comprises a name along with a value. For instance, “Host: en.wikipedia.org” is the “Host” header, and its value is “en.wikipedia.org”. It is used for indicating the host of the resource queried (for instance, the Wikipedia page describing HTTP is available at http://en.wikipedia.org/wiki/HTTP). HTTP headers appear on client requests and server responses.
HTTP/2 makes it possible to exchange requests/responses through streams. A stream is created inside an HTTP/2 connection for every HTTP request and response. Frames are exchanged within a stream in order to convey the content and headers of the requests and responses.
HTTP/2 defines a limited set of frames with different meanings, such as:                HEADERS: which is provided for transmission of HTTP headers        DATA: which is provided for transmission of HTTP message content        PUSH_PROMISE: which is provided for announcing pushed content        PRIORITY: which is provided for setting the priority of a stream        WINDOW_UPDATE: which is provided for updating the value of the control flow window        SETTINGS: which is provided for conveying configuration parameters        CONTINUATION: which is provided for continuing a sequence of header block fragments        RST_STREAM: which is provided for terminating or cancelling a stream.        
Push by servers has been introduced in HTTP/2 for allowing servers to send unsolicited web resource representations to clients. Web resources such as web pages generally contain links to other resources, which themselves may contain links to other resources. To fully display a web page, all the linked and sub-linked resources generally need to be retrieved by a client. This incremental discovery may lead to a slow display of a web page, especially on high latency networks such as mobile networks.
When receiving a request for a given web page, the server may know which other resources are needed for the full processing of the requested resource. By sending the requested resource and the linked resources at the same time, the server allows reducing the load time of the web page. Thus, using the push feature, a server may send additional resource representations at the time it is requested a given resource.
With reference to the flowchart of FIG. 1e, an exemplary mode of operation of a server implementing the push feature is described.
During step 100, the server receives an initial request. Next, the server identifies during step 101 the resources to push as part of the response and starts sending the content response during step 102. In parallel, the server sends push promise messages to the client during step 103. These messages identify the other resources that the server is planning to push, for instance based on the dependencies shown in FIG. 1a. These messages are sent in order to let the client know in advance which pushed resources will be sent. In particular, this reduces the risk that a client sends a request for a resource that is being pushed at the same time or about to be pushed. In order to further reduce this risk, a server should send a push promise message before sending any part of the response referring to the resource described in the push promise. This also allows clients to request cancellation of the push of the promised resources if clients do not want those resources. Next, the server sends the response and all promised resources during step 104. The process ends during a step 105.
The flowchart of FIG. 1d illustrates the process on the client side.
When the client has identified a resource to retrieve from the server, it first checks during a step 106 whether or not the corresponding data is already in its cache memory. In case the resource is already in the cache memory (Yes), it is retrieved from it during a step 107. Cached data may be either data retrieved from previous requests or data that were pushed by the server previously. In case it is not in the cache memory (No), the client sends a request during step 108 and waits for the server's response. Upon receipt of a frame from the server, the client checks during step 109 whether or not the frame corresponds to a PUSH promise. If the data frame corresponds to the PUSH promise (Yes), during step 110, the client processes the push promise. The client identifies the resource to be pushed. If the client does not wish to receive the resource, the client may send an error message to the server so the server does not push that resource. Otherwise, the client stores the push promise until receiving the corresponding push content. The push promise is used so that the client does not request the promised resource while the server is pushing it. In case the data frame does not correspond to the PUSH promise (No), it is checked, during step 111, whether or not, the frame is a data frame related to push data. In case it is related to push data (Yes), the client processes the pushed data during step 112. The pushed data is stored within the client cache. In case the frame is not a data frame related to push data (No), it is checked, during step 113, whether it corresponds to a response received from the server. In case the frame corresponds to a response from the server (Yes), the response is processed during step 114 (e.g. sent to the application). Otherwise (No), it is checked during step 115 whether or not the frame identifies the end of a response (Yes). In this case, the process is terminated during step 116. Otherwise, the process goes back to step 109.
Thus, it appears that the client receives the response and the promised resources. The promised resources are therefore generally stored in the client cache while the response is used by the application such as a browser displaying a retrieved web page. When a client application requests one of the resources that were pushed, the resource is immediately retrieved from the client cache, without incurring any network delay.
The storage of pushed resources in the cache is controlled using the cache control directives. The cache control directives are also used for controlling of the responses. These directives are in particular applicable to proxies: any resource pushed or not, may be stored by proxies or by the client only.
FIG. 1a is a graph of a set of resources owned by a server with their relationships. The set of resources is intertwined: R1, R2, R3, and R4 are resources that need to be downloaded together to be properly processed by a client. In addition, sub-resources A to H are defined. These sub-resources are related to 1, 2 or 3 resources. For instance, A is linked to R1 and C is linked to R1, R2 and R4.
FIG. 1b, already discussed hereinabove, shows an HTTP exchange without using the server PUSH feature: the client requests R1, next it discovers R2, A, B, C and D and request them. After receiving them, the client requests R3, R4, F and G. Finally the client requests H and I sub-resources. This requires four round-trips to retrieve the whole set of resources.
FIG. 1c, already discussed hereinabove, illustrates the HTTP exchange using the feature of pushing directly connected sub-resources by the server. After requesting R1, the server sends R1 and pushes A, B, C and D. The client identifies R2 and requests it. The server sends R2 and pushes F and G. Finally the client identifies R3, R4 and requests these resources. The server sends R3, R4 and pushes H and I. This process requires three round-trips to retrieve the whole set of resources.
In order to decrease the loading time of a set of resources, typically a web page and its sub-resources, HTTP/2 allows exchanging multiple request and response priorities in parallel. As illustrated in FIG. 2, a web page may require the download of several resources, like JavaScript, images etc. During an initial HTTP exchange 200, the client retrieves an HTML file. This HTML file contains links to two JavaScript files (JS1, JS2), two images (IMG1, IMG2), one CSS file and one HTML file. During an exchange 201, the client sends a request for each file. The order given in the exchange 201 of FIG. 2 is based on the web page order: the client sends a request as soon as a link is found. The server then receives requests for JS1, CSS, IMG1, HTML, IMG2 and JS2 and processes these requests according that order. The client then retrieves these resources in that order.
HTTP priorities make it possible for the client to state which requests are more important and should be treated sooner than other requests. A particular use of priorities is illustrated in exchange 202. JavaScript files are assigned the highest priority. CSS and HTML files are assigned medium priority and images are assigned low priority. This approach allows receiving blocking files or files that may contain references to other resources sooner than other files. In response, the server is expected to try sending sooner the JavaScript files, the CSS and HTML files afterwards and the images at the end, as described in exchange 202. Servers are not mandated to follow client priorities.
In addition to priorities, HTTP/2 provides that the amount of data being exchanged simultaneously can be controlled. Client and server can specify which amount of data they can buffer on a per connection basis and a per stream basis. This is similar to TCP congestion control: a window size, which specifies an available buffer size, is initialized to a given value; each time the emitter sends data, the window size is decremented; the emitter must stop sending data so that the window size never goes below zero. The receiver receives the data and sends messages to acknowledge that the data was received and removed from the buffer; the message contains the amount of data that was removed from the buffer; the window size is then increased from the given value and the emitter can restart sending data.
In view of the above, it appears that DASH is based on the assumption that the client leads the streaming since the client can generally select the best representation of the content for the purpose of the application it is performing. For instance, a client may know whether to request High-Definition or Small-Definition content based on its form-factor and screen resolution.
Server-based streaming is typically done using RTP. Contrary to DASH, RTP does not use HTTP and cannot directly benefit from the web infrastructures, in particular proxies and caches. Web socket based media streaming has the same drawbacks. With HTTP/1.1, server-based streaming cannot be easily implemented since the server can generally only answer to client requests. With HTTP/2, in particular with the introduction of the push feature, DASH-based servers can lead the streaming. Thus, servers can use their knowledge of the characteristics of the content they are streaming for optimizing the user experience. For instance, a server may push a film as SD (due to limited bandwidth) but advertisements as HD since advertisements take an additional limited amount of bandwidth. Another example is the case of a server that starts to do fast start with a low-resolution video and switches to the best possible representation once bandwidth is well estimated.
In order to enable a server to lead the streaming, one approach is to let the server push data (in particular DASH data) as preferred. The client then uses whatever data is available to display the video. The server typically announces the push of several segments at once. The server then sends the segments in parallel or successively.
A problem that occurs is that client and server may not know if the promised data will be transmitted and received at the desired time: the client may not know when and in which order the video segments will be sent.
Also, the promised data pushed or announced by the server may mismatch the client needs, thus leading to resource wasting in particular at the server end.
Thus, there is a need for enhancing data streaming especially in the context of DASH-based communications.