This invention relates to video data server technology and more specifically to video on demand systems based on parallel server architectures and related methods for implementation. Most specifically, the invention relates to load balancing and admission scheduling in pull-based parallel video servers.
Pull-based parallel video server configurations have been studied and described, as for example, Jack Y. B. Lee, “Parallel Video Servers—A Tutorial,” IEEE Multimedia, vol. 5(2), June 1998, pp. 20-28, and Jack Y. B. Lee, and P. C. Wong, “Performance Analysis of a Pull-Based Parallel Video Server,” IEEE Trans. on Parallel and Distributed Systems, vol. 11(12), December 2000, pp. 217-231. These configurations are not to be confused with the server-push service model, as for example described in the literature by W. J. Bolosky, J. S. Barrera, III, R. P. Draves, R. P. Fitzgerald, G. A. Gibson, M. B. Jones, S. P. Levi, N. P. Myhrvold, R. F. Rashid, “The Tiger Video Fileserver,” Proc. of the Sixth International Workshop on Network and Operating System Support for Digital Audio and Video. IEEE Computer Society, Zushi, Japan, April 1996; M. M. Buddhikot, and G. M. Parulkar, “Efficient Data Layout, Scheduling and Playout Control in MARS,” Proc. NOSSDAV '95, 1995; and M. Wu, and W. Shu, “Scheduling for Large-Scale Parallel Video Servers,” Proc. Sixth Symposium on the Frontiers of Massively Parallel Computation, October 1996, pp. 126-133.
The following is Table 1, a table with notations and typical numerical values used for evaluation hereinafter:
SymbolDescriptionValueNSNumber of servers8NCNumber of clients80QVideo stripe size65536 bytesLCNumber of client buffersn/aTavgAverage inter-request generation time0.437 sTDVMaximum deviation for request generation time interval0.29 sTroundRound time for the admission scheduler3.495 sNslotNumber of slots in the admission scheduler80TslotLength of an admission scheduler slot0.0437 sdAVariable for client-scheduler delayn/aDAAverage client-scheduler delay0.05 sDA+, DA−Jitter bounds for client-scheduler delay0.005 sdSVariable for client-server delayn/aDSAverage client-server delay0.05 sDS+, DS−Jitter bounds for client-server delay0.005 sToutA, ToutSRetransmission timeout threshold for the client-scheduler,0.11 sand client-server control pathsNretxA, NretxSMaximum number of retransmissions for the3client-scheduler, and client-server control pathsρA, ρSPacket loss probability for the client-scheduler and10−2client-server control pathsβMaximum tolerable packet loss probability for control paths10−6DPA+, DPS+Delay jitter bounds due to retransmission in the0.22 sclient-scheduler and client-server control pathsNANumber of replicated admission schedulersn/aDFMaximum delay in detecting a scheduler failuren/aThbTime interval for periodic heartbeat packetsn/aNhbMaximum number of consecutive lost packets to5declare scheduler failureDmaxMaximum service delay at the video serversn/a
A parallel video server has multiple independent servers connected to client hosts by an interconnection network. The interconnection network can be implemented using packet switches such as FastEthernet or ATM switches. Each server has separate CPU, memory, disk storage, and network interface. The so-called share-nothing approach ensures that the scalability of the system will not be limited by resource contention. Through the interconnection network (e.g. a packet switch) a client retrieves video data from each server block by block and re-sequences the video data for playback. The number of servers in a system may be denoted by NS and the number of clients by NC.
The principle behind parallel video server architecture is the striping of a video title across all servers in a system. A server's storage space may be divided into fixed-size stripe units of Q bytes each. Each video title is then striped into blocks of Q bytes and stored into the servers in a round-robin manner as shown in FIG. 2. The fixed-size block striping algorithm is called “space striping” in Lee, “Parallel Video Servers—A Tutorial,” cited above, as opposed to striping in units of video frames, called “time striping.” Since a stripe unit in space striping is significantly smaller than a video title (kilobytes versus megabytes), this enables fine-grain load sharing among servers. Hereafter, the invention will be described in terms of space striping.
The use of parallelism at the server level not only breaks through the capacity limit of a single server but also enables the use of redundancy to achieving server-level fault tolerance. Unlike server replication and data partitioning, in a parallel scheme a video title to be made available is divided into small units and then distributed over servers in a parallel video server in a technique called server striping. Video data units of a video title are then retrieved from the servers according to a striping policy (space and/or time) for delivery to clients over a communication network.
As a video title is distributed across all servers in the system, one must first retrieve video blocks from the corresponding servers and then merge them back into a single video stream before submitting to the client for playback. In general, the video data merging process (called a proxy) can be implemented in the server (proxy-at-server), in a separate computer (independent proxy), or at the client computer (proxy-at-client). Hereinafter, the system described employs a proxy-at-client architecture. The choice is two-fold: (a) lower cost—no additional inter-server data transfer (proxy-at-server) or additional hardware (independent proxy) is needed; and (b) better fault tolerance—failure of the proxy affects only the client running at the same computer.
The term “service model” refers to the way in which video data are scheduled and delivered to a client. There are two common service models: client pull and server push. In the client-pull model, a client periodically sends requests to a server to retrieve video data. In this model, the data flow is driven by the client. In the server-push model, the server schedules the periodic retrieval and transmission of video data once a video session has started.
In the client-pull service model, each request sent from a client is served at the server independently of all other requests. Hence, the servers need not be clock-synchronized, since synchronization is implicit in the client requests. Hereafter, it is assumed that the client-pull service model is used. Without loss of generality, it will be assumed a client sends request i (i≧0) to server mod(i,NS). Each request will trigger the server to retrieve and transmit Q bytes of video data.
An issue in parallel video server Video on Demand systems not found in conventional single-server Video on Demand systems is known as load balancing. While the server striping of video titles over the servers using small stripe size ensures that the average load is balanced, the instantaneous load at the servers may vary due to randomness in the system. This instantaneous load imbalance can temporarily degrade the server's performance and cause video playback interruptions at the client.
In order to better understand the invention, it is helpful to consider an analytical model of the request generation process in a pull-type service-based system. A portion of this model was previously developed by the inventor and reported in “Performance Analysis of a Pull-Based Parallel Video Server,” cited above. Assuming the system uses a credit-based flow control algorithm to manage the data flow from the servers to the client, the client maintains LC buffers (each Q bytes) of video data to absorb system delay variations. Before playback starts, the client will first pre-fetch the first (LC−1) buffers, and then request one more video block whenever the head-of-line video block is submitted to the video decoder for playback.
Assuming the video client generates requests with an average inter-request time interval of Tavg seconds, then to account for variations in the request-generation process, let TDV be the maximum deviation for the process such that the time span between any k consecutive requests is bounded bymax{((k−1)Tavg−TDV),0}≦t≦((k−1)Tavg+TDV)  (1)
Since a client generates requests to the NS servers in a round-robin manner, the corresponding time span between any k consecutive requests sending to the same server can be obtained frommax{((k−1)NSTavg−TDV),0}≦t≦((k−1)NSTavg+TDV)  (2)
With this request-generation model, it can be shown that:
Theorem 1 Assume n clients generating requests independently and each client sends requests to the NS servers in the system in a round-robin manner, then the minimum time for a server to receive k video data requests is given by
                                          T            Request            min                    ⁡                      (                          k              ,              n                        )                          =                  max          ⁢                      {                                                                                (                                                                  ⌈                                                  k                          n                                                ⌉                                            -                      1                                        )                                    ⁢                                      N                    S                                    ⁢                                      T                    avg                                                  -                                  T                  DV                                            ,              0                        }                                              (        3        )            
Regardless of the number of servers in the system, Theorem 1 shows that a server can receive up to n requests simultaneously (TRequestmin(k,n)=0) if multiple clients happen to be synchronized. This client-synchrony problem has been previously shown to severely limit the scalability of the system.
To prevent instantaneous load imbalance, an admission scheduler is used to explicitly schedule the start times of new video sessions to avoid synchrony. Previously, the inventor with others proposed a staggering scheme as depicted in the first line (a) of FIG. 3 (Prior Art) for use in the admission scheduler. The scheduler maintains an admission map of length Tround seconds and is divided into Nslot slots of lengthTslot=Tround/Nslot  (4)                (in seconds)        
Each time slot has two states: free or occupied. When a client wants to start a new video session, it will first send a request to the scheduler. Ignoring processing delays and assuming the request arrives at the scheduler at time t, the scheduler will admit the new session if and only if the time slot n is free, where n is given by:n=┌mod(t,Tround)/Tslot┐  (5)This is illustrated in the second line (b) of FIG. 3b (Prior Art).
To admit a new session, the scheduler will send a response back to the client when slot n begins and mark the corresponding time slot as occupied until the session terminates. Conversely, if the requested time slot is already occupied, the scheduler will wait (effectively increasing t) until a free time slot is available, as illustrated in the third line (c) of FIG. 3 (Prior Art). With the setting of Tround=NSTavg, one derives the worst-case load in Theorem 2 below.
Theorem 2 If the admission scheduler is used with parameters Tround=NSTavg and there are n clients, then the minimum time for a server to receive k video data requests is given by
                                          T            Request            min                    ⁡                      (                          k              ,              n                        )                          =                  {                                                                                                                                                                  max                          ⁢                                                      {                                                                                                                                                                uN                                    S                                                                    ⁢                                                                      T                                    avg                                                                                                  -                                                                  T                                  DV                                                                                            ,                              0                                                        }                                                                          ,                                                                                                                                      mod                          ⁡                                                      (                                                          k                              ,                              n                                                        )                                                                          =                        1                                                                                                                                                                          max                          ⁢                                                      {                                                                                                                                                                uN                                    S                                                                    ⁢                                                                      T                                    avg                                                                                                  -                                                                  T                                  DV                                                                +                                                                  vT                                  slot                                                                                            ,                              0                                                        }                                                                          ,                                                                                    otherwise                                                                      ⁢                                                                  ⁢                                                                  ⁢                where                ⁢                                                                  ⁢                                                                  ⁢                u                            =                              ⌊                                                      (                                          k                      -                      1                                        )                                    /                  n                                ⌋                                      ,                                          and                ⁢                                                                  ⁢                v                            =                              mod                ⁢                                                                  ⁢                                                      (                                                                  k                        -                        1                                            ,                      n                                        )                                    .                                                                                        (        6        )            
Comparing Theorem 1, the requests are spread out by the admission scheduler so that the worst-case load is substantially reduced.
A key performance measure of a pull-based VoD system is service delay at the video server, denoted by Dmax. Service delay is defined as the time from the server receiving a client request to the time the requested video block is completely transmitted. This service delay determines the amount of buffer needed at the client to ensure video playback continuity. As the service delay generally increases with the number of concurrent video sessions, it effectively imposes a limit on the maximum number of concurrent video sessions supportable by the system. Given the disk model, network model, and the bounds in Theorem 2, an upper bound for the service delay can be derived. This maximum service delay is used to evaluate the performance of the system under different parameters.
It has been shown previously that a admission scheduler can effectively prevent instantaneous load imbalance and allow the system to scale up to a large number of servers. However, there were two assumptions: (a) there is no network delay; and (b) there is no packet loss in delivering control messages. The model heretofore described and taken from the inventor's prior work in “Performance Analysis of a Pull-Based Parallel Video Server,” cited above, does not incorporate the effect of network delay and delay jitter, and to consider packet loss.
A problem not considered in the prior model developed by the inventor is packet loss in the client-scheduler link, as well as in the client-server link. While packet loss is relatively infrequent in today's high-speed networks, it still cannot be ignored. First, losing control packets between a client and the scheduler will render the system's state inconsistent. For example, if the admission-accept request sent from the scheduler to a client is lost, the client may have to wait a complete schedule period of NSTavg before discovering the packet lost, since in the worst case, the admission scheduler may indeed need to delay the admission of a new session due to the staggering requirement. Meanwhile, the assigned time slot will be occupied for the same duration even the client never starts the video session. Consequently, new admission requests may be rejected even if the system is running below capacity. Second, losing control packets in the client-server link will result in missing video blocks since the server only sends video data upon receiving a client request. Therefore the control path for both client-scheduler link and client-server link must be reliable.
To tackle the packet-loss problem, one may use a reliable transport protocol to carry control packets. However, unlike conventional data applications, the choice of the transport protocol could have a significant impact on the system's performance. To see why, consider using TCP as the transport protocol for the client-scheduler link. If packet loss occurs, the TCP protocol will time out and retransmit the packet until either it is correctly delivered, or the link is considered to have failed. Since most transport protocols (including TCP) make use of adaptive algorithms to dynamically adjust the timeout threshold, the timeout will be increased substantially if multiple retransmissions are needed.
In practice, the worst-case delay introduced by such transport protocols could go up to tens of seconds. Comparing the average network delay (in milliseconds), the worst-case load at a server will be increased significantly if such transport protocol is used for carrying control traffic.
It has been determined that instantaneous load imbalance can occur and significantly hamper the performance of a pull-type parallel video system. An admission scheduler is critical for maintaining instantaneous load balance across servers in the system, it can also become a single-point-of-failure of the entire system. An architecture and supporting processes are therefore needed to avoid points of failure and performance degradation in pull-based architectures.