Embodiments of the current invention are related to media streaming and particularly to a system and method to optimize media streaming over one or more IP networks.
In the specification and claims which follow, the expression “media streaming” or “streaming” is intended to mean the transfer of video information (and any associated audio information, if applicable), as known in the art, typically from one or more of servers to a plurality of devices (typically called “clients”) located at a distance from the respective servers. As such, terms such as “video content”, “content”, and “media stream” (or abbreviated “stream”) are used interchangeably in the specification and claims which follow hereinbelow to mean video content which is streamed. Typically, a stream comprises a plurality of “packets”, as known in the art and described further hereinbelow.
Other terms used in the specification hereinbelow, which are known in the art, include:                “Moving Picture Experts Group (MPEG)” is intended to mean a working group of experts, formed by ISO and IEC to set standards for audio and video compression and transmission;        “MPEG transport stream (TS)” is intended to mean a standard format for transmission and storage of audio, video, and program and system information protocol (PSIP) data. Transport Stream is specified in MPEG-2 Part 1, Systems (formally known as ISO/IEC standard 13818-1 or ITU-T Rec. H.222.0);        “TS Packet” is intended to mean the basic unit of data in a transport stream. “Program Clock Reference (PCR)” is intended to mean a value transmitted in the adaptation field of an MPEG-2 transport stream packet. PCR, when properly used, is used to generate a system_timing_clock in a decoder to present synchronized content, such as audio tracks matching the associated video, at least once each 100 ms;        “Presentation timestamp (PTS)” is intended to mean a timestamp metadata field in an MPEG transport stream or MPEG program stream that is used to achieve synchronization of programs separate elementary streams (i.e., video, audio, subtitles). Reference: https://en.wikipedia.org/wiki/Presentation_timestamp#cite_note-teknotes-1        “Group of Pictures (GOP)” has an intended meaning of a group of pictures, or GOP structure in video coding, (ref https://en.wikipedia.org/wiki/Data_compression#Video) and specifies the order in which intra- and inter-frames are arranged. GOP is a group of successive pictures within a coded video stream. Each coded video stream consists of successive GOPs. Visible frames are generated from the pictures contained in GOP;        “Packetized Elementary Stream (PES)” is intended to mean a specification in the MPEG-2 Part 1 (Systems) (ISO/IEC 13818-1) and ITU-T H.222.0 that defines carrying elementary streams (usually the output of an audio or video encoder) in packets within MPEG program stream and MPEG TS. The elementary stream is packetized by encapsulating sequential data bytes from the elementary stream inside PES packet headers. “Real-time Transport Protocol (RTP)” is intended to mean a standardized packet format for delivering audio and video over IP networks. RTP is used extensively in communication and entertainment systems that involve streaming media, such as telephony, video teleconference applications, television services and web-based push-to-talk features. RTP is used in conjunction with the RTP Control Protocol (RTCP). While RTP carries media streams, RTCP is used to monitor transmission statistics and quality of service (QoS) and aids synchronization of multiple streams. RTP is originated and received on even port numbers and the associated RTCP communication uses the next higher odd port number. RTP was developed by the Audio-Video Transport Working Group of the Internet Engineering Task Force (IETF) and first published in 1996 as RFC 1889, superseded by RFC 3550 in 2003;        “User Datagram Protocol (UDP)” is intended to mean one of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an IP network without requiring prior communications to set up special transmission channels or data paths. UDP uses a simple transmission model without implicit handshaking dialogues for providing reliability, ordering, or data integrity. Thus, UDP provides an unreliable service and datagrams may arrive out of order, appear duplicated, or go missing without notice. UDP assumes that error checking and correction is either not necessary or performed in the application, avoiding the overhead of such processing at the network interface level.        “Forward Error Correction (FEC)” is intended to mean a technique to recover partial or full, packet information based on calculation made on the information. FEC may be effected by means of XOR between packets or another mathematical computation;        “Pro-MPEG” is intended to mean Professional-MPEG Forum—an association of broadcasters, program makers, equipment manufacturers, and component suppliers with interests in realizing the interoperability of professional television equipment, according to the implementation requirements of broadcasters and other end-users;        “SMPTE 2022” is intended to mean an FEC standard for video transport, initially developed by Pro-MPEG Forum and added to by the Video Services Forum, and describes both a FEC scheme and a way to transport constant bit rate video over IP networks.        
Media streaming over switching IP networks such as fiber, leased line, CDN, public IP, wireless data networks, VSAT, and cellular networks is a challenging technical problem. A media stream may be impacted by a number of network aberrations (ex: packet loss, jitter, disorder, and capacity changes, inter alia) that make it difficult to sustain a constant stream from sender to receiver. In parallel to data connectivity growing worldwide, clients want to be able to receive media content to their devices (mobile phones, tablets, TV, PC and similar playing devices) with the best quality and the shortest delay.
Reference is currently made to FIG. 1, which is a prior art block diagram of a media server 15 (also referred to as a “media sending device” or a “sender” hereinbelow and in the claims which follow) connected with a plurality of receiving devices 20 (i.e., mobile devices, smart TVs, inter alia) over a plurality of networks 25 (i.e., mobile and wireless networks, inter alia). Each network and/or device may experience different network impairments and network capacities. For example, a cellular network may be more prone to capacity problems while a wireless network is more prone to packet loss.
There are three main approaches known in the art which address the problem of media streaming over switching IP networks, as described hereinbelow.    1. Well managed networks, have UDP/RTP and redundant protection information in the form of forward error correction (FEC), which is sent with the media stream and consumes 30-50% extra bandwidth in one direction. This solution has a low time delay; however it may not tolerate high packet loss nor network capacity drop-off.    2. For small scale operation, streaming with retransmission protection, also called Automatic Repeat-reQuest (ARQ) may be used. However ARQ is not useful for large-scale operations. ARQ has modest time delays, it may tolerate high packet loss, but it cannot tolerate network capacity drop.    3. For large distribution over multiple networks, HTTP adaptive bit rate (HTTP ABR) streaming has become a de facto standard for most over-the-top systems. ABR has a large time delay. Packet loss and network capacity drops are managed by reducing bit rate to a lower value—inferring large time delays.
Each of the three main approaches listed above is addressed hereinbelow:
UDP/RTP
Media streaming with UDP/RTP is not suited for mobile or mass distribution application as these larger-scale networks are not considered “managed”.
ARQ
Another solution, ARQ, is currently offered by several vendors to address 100% recovery of lost packets. ARQ has been found to offer superior performance at lower overhead compared with existing packet loss recovery solutions.
Reference is currently made to FIG. 2, which a prior art block diagram, similar to that of FIG. 1, showing an ARQ configuration of media streaming based on a method called protection streaming, having a sender 35, connected with a plurality of ARQ receiving devices 40 (also referred to hereinbelow individually as “ARQ receiver”) over a plurality of networks 45 (indicated as Network 1, . . . Network N−1, Network N in the figure), with a respective receiver being fed from a respective network.
Protection streaming performed with the configuration shown in FIG. 2 allows each receiver to be addressed differently, thus allowing for precise packet recovery to be applied to each receiver.
Prior art ARQ systems work with a sender sending/transmitting UDP/RTP packets in a stream over an unmanaged IP-based packet network to several receivers. Packet loss detected by a receiver is reported to the sender using special RTCP messages. Each message may contain one or more different requests. The ARQ packet processing is effective when network capacity is larger than that of the initial media stream bandwidth. As noted previously, the ARQ process allows for packet recovery with retransmission of lost packets. However if the network capacity (i.e. maximum bandwidth available for the network) drops below that of the media stream bandwidth, the ARQ method (i.e. protection streaming) cannot effectively recover lost packets.
Reference is currently made to FIG. 3, which is a prior art flow and block diagram showing an exemplary video stream 50 from a sender 52 to an ARQ receiver 65 and a loss of several packets 55 (indicated as D2, D6, D8, and D9) and subsequent respective request packets 60 (indicated as R2, R6, R8-9.) In general, a receiver requests resending packets several times during a time window in which a packet is in a receiver buffer (not show in figure). In the figure, sender 52 processes the receiver's request packets (R2, R6, R8-9) and sends respective recovery packets 62 (D3, D5, D10) back to the receiver on the main content stream (indicated by the arrows connecting the sender with the receiver).
A major shortcoming of such an ARQ system is that sometimes the IP link (i.e. the bandwidth between the sender and the receiver) may reach its capacity limit due to either a physical connection (ex: ADSL/VDSL) or by a capacity limit provided by the service provider (ex: a mobile network provider). As shown in FIG. 3, ARQ systems can send a burst of recovery packets in response to a burst of packet loss requests. The burst of recovery packets may block or interfere with the stream's packet flow, causing additional lost packets.
Some ARQ systems limit the link by employing traffic shaping as known in the art. Traffic shaping can act to impact both the stream and the recovery packets by limiting bandwidth, effectively not addressing situations where recovery packets may block the media stream.
HTTP ABR
Prior art HTTP ABR (HTTP adaptive bit rate) systems use a method which employs several encoding profiles of the same video content, which is split into segments using dedicated logic. The term “profile” (as in “encoding profile”, for example) is intended to mean in the specification and claims which follow hereinbelow encoder settings for streaming, as known in the art. Included in a profile are: encoding method, resolution, bitrate, and additional information.
An original media stream is encoded in several profiles; each media stream is then split to segments. A segment is a portion of video pictures or frames and in most cases is in a GOP (group of pictures) boundary. Data relevant to respective segment bitrate and location are published to a client. The client then decides which segment to download; each segment having a predefined bitrate. The client may download several consecutive segments and use a specific algorithm to determine the order of successive segments. (One example of a specific algorithm could be that based on time-to-download a segment.)
The use of several segments requires a larger buffer (having a size of at least 2-3 segments), which yields a larger time delay for media streaming. The bitrate for respective profiles is fixed at the origin point, as known in the art, and the client must decide which segment to take from respective profiles. The entire media streaming process is time consuming (due to the need to buffer the segments).
Most ABR solutions employ HTTP protocol as the signaling and data transfer protocol. HTTP protocol is used to initiate the connection between client and server and use the HTTP tunneling capability to pass public internet and firewalls. HTTP adaptive streaming is widely adapted to tackle network aberrations and capacity changes. Several standards have been developed which differ from one another by signaling method or by underlining encoding, having a similar functionality. The underlying TCP protocol is very susceptible to network impairments such as packet loss and jitter.
Prior art solutions do not optimize available bitrate nor do they effectively overcome network-related inconsistencies, such as packet loss and bandwidth fluctuations.
ABR employs a scheme of several profiles to represent the same media source. The clients downloads a section of the media and if the operation is halted or stalled due to network aberrations or due to a capacity drop, a smaller size profile of that same media section is downloaded to try to overcome the problem. Profiles are prepared in advance to ‘guestimate’ which bitrates are more likely to pass through the network to allow the client to switch between one profile and another. The client must ensure sufficient buffer time to allow switching between profiles, storing several profiles, and to then play them out.
As noted hereinabove, high buffering requirements and time delays are experienced between the original media source and the output to the client. Most client solutions do not include an attempt to recover packet loss or to overcome large delays, as these issues are related to limitations of TCP protocol. Generally, lowering bitrate profiles infers a lower likelihood of suffering from packet loss or capacity problems.
Another limitation of such methods is that link capacity cannot be fully utilized. This is because as a profile segment is downloaded and the client has little information on the real link capacity/potential, the only way to switch to a higher capacity profile is by trial-and-error. Studies have shown that, on average, effective ABR tends to use only 50-60% of available network capacity, as higher utilization of network capacity yields packet loss. Standards in use may be HTTP Live Streaming (HLS), Adobe HTTP Dynamic Streaming (HDS), Microsoft Smooth Streaming and MPEG-DASH—all as known in the art, ref http://en.wikipedia.org/wiki/Adaptive_bitrate_streaming.
Workflow—Prior Art
An input video feed is encoded by a multi profile encoder, which creates several profiles of the video stream. Each profile may have a different encoding bit rate and video resolution. Most techniques employ either: open a group of pictures (GOP; having no definite number of frames), or; a closed GOP (fixed number of frames per GOP)—all as known in the art.
Profiles are packaged and segmented into blocks. Blocks are documented in a manifest/playlist list. Respective block information is published with its properties and location (bitrate and resolution)—as known in the art. Profiles are stored in HTTP servers and may be downloaded by a remote client based on the information in a manifest/playlist.
Every client is responsible to pull (download) segments to its local memory/buffer and to assure a full segment download. For this purpose, the client monitors each new segment download. As a download is performed over HTTP, the download is susceptible to packet loss and to network jitter, which may contribute to increase segment download time. This is because the TCP protocol—which is the heart of the HTTP—contributes a time delay until all the data has been downloaded.
During a download, the client can sense download speed and/or a network artifact adversely impacting the download. The client can decide to stop the current download and/or to select a different profile for the current or the next segment download. This approach assures a continuous download of segments, with each adaptive client using at least a triple-buffer approach to buffer several consecutive segments.
For segment N, the buffer will be constructed as:                Segment (N−2)—waiting to be played;        Segment (N−1)—last downloaded; and        partial Segment (N)—downloaded.        
For each standard (HLS-HDS, SMOOTH, etc.), a client should determine the amount of buffering and segments stored prior to play out. A common practice in the art is to allocate sufficient buffering to store at least 3 segments. However, most solutions tend to have more buffering capacity than 3 segments to allow switching the last downloaded profile with a lower resolution in case of streaming problems.
As several full segments are downloaded, the client starts to read the stream in an orderly fashion, as the segments are GOP-aligned (based upon a start and stop of the GOP boundary). This approach allows seamless stitching of the stream and smooth decoding, as known in the art.
Most multi-profile solutions known in the art do not maintain the same resolution across different profiles nor is it mandatory to keep the same GOP structure. Most solutions tend to use MPEG2 transport for profiles and have PCR/PTS/DTS information. Most solutions have adopted the H.264 encoding standard. Some multi-profile solutions have adopted newer encoding options such as H.265, with the condition that encoding remains the same between profiles. A profile may share a common PCR counter to sync DTS and PTS.
Because an adaptive streaming viewing experience may be different from one vendor and another, or from one location to another, adaptive streaming does not provide the same viewing experience characteristic of linear/conventional TV, which gives all viewers the same delay and viewing experience.
There is therefore a need to have a media streaming system that can operate over challenging network impairments, and which can provide the highest media bandwidth and shortest time delay to the receiver.