Video compression algorithms, such as those standardized by the standardization organizations ITU, ISO, and SMPTE, exploit the spatial and temporal redundancies of the images in order to generate bitstreams of data of smaller size than the original video sequences. Such compressions make the transmission and/or the storage of the video sequences more efficient.
Most of the video compression schemes, such as the MPEG-1, MPEG-2, MPEG-4, H.261, H.263, H.264 or H.265 formats, take advantage of the so-called “temporal” redundancies between several successive images of the same sequence. Thus, most of the images are coded relative to one or more reference images by comparing similar blocks and then coding the prediction error. This prediction is commonly referred to as temporal or “Inter”.
In the case of the MPEG-2 format for example, images of I type (I for Intra) are encoded without reference to other images of the sequence. Thus, when all the compressed data of such an image are available, a decoder may decode and display that image immediately. An image of I type thus constitutes a conventional point of access to the video sequence. It is to be noted that, conventionally, these images of I type are presented periodically, with a period of the order of several tenths of a second to a few seconds. In the case of the H.264 format, these images are denoted “IDR” or “SI”.
The MPEG-2 format also implements images of P type (prediction on the basis of the last I image) or B (bi-directional prediction on the basis of preceding and following images of P or I type) which are encoded by prediction relative to one or more reference images. The data compressed relative to such images (i.e. data coding the prediction errors) are not sufficient to obtain an image that can be displayed. This is because the data of the reference images which were used at the time of the prediction must be obtained. Images of P type and B type do not therefore constitute efficient points of access to the video sequence.
The temporal prediction mechanism consequently proves to be extremely efficient in terms of compression, but imposes constraints on the video decoders that wish to provide proper reconstruction of the images of the same sequence, in particular by limiting the temporal random access for the compressed video sequence only to the images of I type.
Cumulatively with the exploitation of temporal redundancies, the video coders also take advantage of so-called “spatial” redundancies within the same image. For this, each image is decomposed into spatial units, blocks or macroblocks, and a block may be predicted from one or more of its spatially neighboring blocks, which is commonly referred to as spatial prediction or “Intra” prediction.
This mechanism when applied in particular in the case of the Intra images referred to previously also notably improves the compression of a video sequence. However, dependency between the blocks is introduced, and this complicates the extraction of a spatial part only of the sequence.
To mitigate this drawback, certain coding schemes such as H.264 provide an organization of the blocks into interdependent packets or “slices” of blocks not having spatial dependencies with blocks outside that packet. The organization into packets relies on a technique known as FMO for “Flexible Macroblock Ordering”.
These packets are very often signaled by markers enabling a decoder to obtain access thereto without performing decoding and complete reconstruction of the image, and in particular of the blocks which precede them in the image. Each packet or slice thus constitutes a point of “access” to the video sequence or of spatial synchronization on the basis of which the decoder has no difficulty in performing decoding independently of the other packets.
Nevertheless, for these blocks, temporal dependencies may remain if the image is coded with reference to one or more other images. Thus, the accumulation of the temporal and spatial predictions means that generally the extraction of a spatio-temporal part of a video sequence, that is to say a spatial portion during a temporal section of several consecutive images of the sequence, is a complex operation.
The extraction of a spatio-temporal part from a video sequence is therefore these days the subject of extensive developments.
The W3C (“World Wide Web Consortium”, an organization producing standards for the Web) is working on the development of a mechanism for addressing temporal segments or spatial regions in resources that are available on the Web such as video sequences, by using in particular URIs (“Uniform Resource Identifiers”) making it possible to identify, via a string, a physical or abstract resource.
This mechanism, independently of the format of representation of the resource, is termed “Media Fragments”.
The RFC (“Request For Comments”) memorandum number 3986 defines a syntax for the URIs, and integrates in particular the concepts of “fragment” and of “queries” or requests. In this context, a fragment is in particular a part, a subset, a view or a representation of a primary resource.
The “Media Fragments” addressing enabling the access to sub-parts of the audio or video stream or within images, by addition of parameters to the request, following the URI address, makes it possible for example to address:                temporal segments (or “temporal fragments”) defined by initial and terminal times: t=00:01:20,00:02:00 identifying the segment from 1 min20 s to 2 min00 s; and/or        spatial regions (or “spatial fragments”) defined by a generally rectangular viewing region: xywh=10, 10, 256, 256 specifying the upper left corner (10, 10), the width (256) and the height (256) of the rectangle; and/or        substreams (or “track fragments”), for example a particular audio track associated with a video track=‘audio_fr’; and/or        passages (or “named fragments”) pre-defined via an identifier, a scene of a film for example: id=‘the_scene_of_the kiss’.        
In addition to the syntax of the fragments/requests for the addressing thereof, the same working group is in course of producing a client-server communication protocol based on the HTTP protocol (“Hyper Text Transfer Protocol”) used on the Web.
In particular, the protocol defines the HTTP requests sent by a client wishing to obtain fragments as well as the responses sent by the server responsible for the extraction and for the sending of those fragments. Each HTTP request or associated HTTP response is composed of header information and data information. The header information may be considered as description/signaling information (in particular as to the type of the data exchanged and as to the identity of the data sent back—region finally sent back) whereas the data information correspond to the spatial and/or temporal fragment of the resource requested by the client.
When the requested fragment can be converted into “byte ranges” either because the client has already received a description of the resource before sending his request, or because the server performs an analysis of the resource before sending it, the exchange of fragments is similar to a conventional exchange of data via HTTP, which makes it possible to exploit cache mechanisms and thereby be fully integrated into a Web architecture.
If on the other hand the fragment cannot be converted into one or more byte ranges belonging to the original resource, transcoding is then necessary at the server, and the new resource so created is sent to the client like any other resource.
This addressing mechanism and the associated communication protocol are advantageously flexible in that they are independent from the video format used, from the encoding options of the video sequences and from the capacities available on the servers processing such requests.
Furthermore, the setting up of this addressing mechanism and of the associated communication protocol will eventually make it possible to significantly reduce the quantity of information exchanged between a client requesting parts of a video and a server storing that video and supplying those requested parts.
This is because, if a client only wishes to view a spatial part of a video sequence, he henceforth no longer needs to download the entirety of the video stream, but only the spatial region concerned possibly in a desired temporal interval.
For example, the spatial filtering syntax implemented is extremely simple, consisting in indicating in the request the target spatial region desired, generally in the form of a rectangle defined using four parameters (in pixels or in percentage of the entire image):
hyper text transfer protocol of the world wide web example.org/my_video.mp4#xywh=percent;25,25,50,50 defines the target spatial region centered on the image and whose dimensions are half those of the entire image.
The portion or “fragment” of the video sequence identified here is said to be “spatial” in that it results from spatial filtering of the original video sequence by the indication of a target spatial region. Of course, other filtering criteria (for example temporal) may be added to this request.
This request is received and processed by a server storing the target video sequence. Theoretically, only the desired portion of the video sequence is exchanged between the server and the client. This makes it possible in particular to reduce the transmission time as well as the bandwidth used over the communication network from the server to the client, on account of the reduced number of data transmitted.
However, in practice, the servers storing the video sequences have some difficulty in extracting and sending the desired portion filtered from the video sequence, in particular on account of the access difficulties resulting from the temporal and/or spatial dependencies between blocks.
To be precise, the desired portion can only be extracted alone further to heavy processing at the server, requiring considerable resources. This is for example the case when transcoding mechanisms are implemented. It is also the case when all the links for predictions have been resolved by the server for selecting the exact set of the data blocks relative to the desired portion and to the blocks serving as reference blocks for the prediction.
Such approaches prove to be ill-adapted to direct communications between equipment that has limited resources available, such as camcorders, TV decoders, television sets, mobile telephones, personal digital assistants and the like.
These same difficulties arise for local accesses to a video sequence.
In contrast to the obtainment of only the desired portion, a more economical approach for the server consists of sending the entirety of the video sequence to the requesting client. However, in this case, the processing operations carried out by the client are very heavy, and generally incompatible with its own resources, in particular if it is a case of clients having scarce processing resources such as portable terminals. Furthermore, this approach requires a high network bandwidth to ensure the transmission of the data from the server to the client.
As an addition to these transmission mechanisms, there are methods for compensating for possible data losses suffered during the transmission of those data.
In particular, the publication US 2006/050695 describes a method of streaming video data compressed using prediction mechanisms, which provides an improvement in error resilience.
One of the principles set out relies on the transmission, by the streaming server, of a redundant representation of a reference image, which may possibly be partial, when the latter is subject to transmission errors (loss, corruption).
The method consists in identifying the redundant representations which enable the errors suffered to be made up for. This identification depends on feedback from the user identifying the packets not received. Lastly, the redundant representation of smallest size is the one chosen to be transmitted.
This method is not however suitable where a user wishes to access and obtain a spatial fragment corresponding to a specific spatial region of the video. This is because the method relies on the transmission of the entirety of the video to the user.
Furthermore, the approach regarding the transmission of a redundant representation is purely reactive in that it is directed to correcting erroneous past data (the reference images) which should already have been received. This therefore amounts to processing these data a second time, which leads to a cost increase, both for the server and for the user.