A continuous media file server system is designed to serve continuous data streams, such as audio and video data files, to multiple clients. As an example, a file server system might simultaneously supply multiple digital data streams, each in the 1–10 megabits-per-second (Mb/s) range, to thousands of clients.
General Architecture
FIG. 1 shows a continuous media file server system 20 developed by Microsoft Corporation. The file server system is a distributed, scalable, and fault-tolerant server that can serve many continuous data streams simultaneously to a large number of clients. The file server system 20 has a central controller 22 connected to multiple data servers 24(1), 24(2), 24(3), . . . , 24(K) via a low bandwidth control network 26. The controller 22 receives requests from clients, such as requests for starting and stopping a particular data file. The controller 22 is responsible for initiating delivery of streaming content to the requesting clients, including such tasks as locating the data server that holds the first block of data in the requested data file. The controller and data servers can be implemented, for example, as general purpose computers.
Each data server 24 supports at least one storage device, such as a disk, as represented by storage disks 28(1), 28(2), . . . , 28(M) connected to data server 24(1). The disks 28 are attached to their respective data server 24 via one or more buses 30 (e.g., SCSI, Fiber Channel, EIDE, etc.). The number and configuration of storage disks are flexible, but within a given file server 20, all data servers 24 support the same number of storage disks 28. The storage disks can store large amounts of digital data, with example disk capacities of many Gigabytes. The storage capacity of the entire media file server 20 consists of the usable storage space on the storage disks. An operator can change the storage capacity of the file server by adding or removing one or more storage disks to or from each data server, or adding or removing one or more of the data servers to which the disks are connected.
The data servers 24 are connected to a high-speed network switch 32 via network interfaces 34 (e.g., network card). The network switch 32 takes the data segments read from the storage disks, orders them into a continuous stream, and distributes the streams over a network to the clients. The network switch 32 also provides high bandwidth, parallel communication between the data servers 24. Additionally, the controller 22 may be connected to the data servers 24 through the network switch 32, as opposed to a separate control network 26. As an example, the network switch 32 can be implemented using fiber optics and ATM (Asynchronous Transfer Mode) switches.
Each data server 24 contains a memory buffer, as represented by buffer 36 in data server 24(1). The buffer 36 temporarily stores data that is read from the disks 28(1)–28(M) and is to be output to the network switch 32.
The continuous media file server system 20 can be implemented in different contexts. For instance, the file server system 20 might function as a head end server in an interactive television (ITV) system, which serves audio and video files over a distribution network (e.g., cable, satellite, fiber optic, etc.) to subscriber homes. The file server system 20 might alternatively operate as a content provider that distributes data files over a network (e.g., Internet, LAN, etc.) to multiple client computers.
Data Striping
It is likely that some pieces of content will be more popular than others. For example, the top ten percent of movies ordered by popularity might garner 70% of the load, while the remaining 90% of the content attracts only 30% of the viewers. To avoid disproportionate use of storage disks 28 and data servers 24 (i.e., by overburdening the disks and data servers holding popular content while leaving other disk and data servers underutilized), the continuous media file server system 20 stripes all of the data files across all of the storage disks 28 and all of the data servers 24. When a client requests a data stream, all data servers 24 share in the distribution of that stream, each supplying a portion of the data stream in turn. In this way, the load is spread over all of the storage disks 28 and data servers 24 regardless of the data file's popularity.
Prior to this invention, the data streams were served at a constant data transmission bit rate. With this assumption, each data file could be broken into “blocks” of fixed temporal width. A block represented the amount of physical space allocated on a disk to hold one time unit of data, and could be expressed in terms of bytes. The temporal duration required to play the data in the block is known as a “block play time”. For a data rate of 1 Mb/s, for example, the block size might be 1 Megabit and the block play time might be one second. In the conventional file server, a single block play time is established for all data files, resulting in a fixed-size data block.
FIG. 2 shows an example file server disk array 40 consisting of six data servers 0–5, each supporting two storage disks. Each disk stores data blocks, as represented by the labeled rectangles such as “A0”, “A6”, etc. Data files are striped across every storage disk of every server. For each data file, a starting disk is chosen to hold the first data block. For instance, the first block of data file A, designated as block “A0”, is stored on disk 0 of data server 0. A server index is incremented, and the next block in the file (i.e., block “A1”) is placed on disk 0 of server 1. The striping continues across the first disks of each server.
When the last server 5 is reached, the striping pattern wraps and continues with the next disks of each server. More specifically, when the server index reaches the number of servers in the system, a disk index is incremented (modulo the number of disks per server) and the server index is reset to 0. In FIG. 2, after data block A5 is placed on disk 0 of server 5, the next block in the file (i.e., block “A6”) is placed on disk 1 of server 0. Block A7 is then placed on disk 1 of server 1, and so on. This process continues until all the data blocks of the video file have been assigned to disks.
The process is then repeated for each subsequent data file. Typically, the striping pattern starts the various data files on different starting disks. In FIG. 2, two data files A and B are shown. Data file A begins on disk 0 of server 0, and data file B begins on disk 0 of server 1.
The striping pattern generally prescribes that the data blocks are sequentially ordered across ordered disks, but the sequential blocks need not reside at the same physical block address on adjacent disks. For instance, the striping pattern of files A and B result in the storage of sequential blocks blocks B3 (disk 0, server 4) and B4 (disk 0, server 5) at different physical locations on the two disks (location 3 for block B3 and location 2 for block B4). Accordingly, sequential data blocks can reside at entirely different physical block locations within the contiguous disks. The block locations in the disk array are described by file metadata that is stored either in memory or on disk. It is noted that other patterns are possible.
To play a data file, the file server system 20 serves the data blocks sequentially from the storage disks, one block at a time. The data blocks are read from each disk, stored temporarily in buffer memory 36 at the server 24, and transmitted to the network switch 32 in order. When file A is requested by a client, for example, block A0 is read from disk 0 (server 0) and transmitted via server 0 to the network switch for the duration of a block play time. Next, block A1 is read from disk 0 (server 1) and transmitted via server 1 to the network switch for the duration of a block play time. The striping arrangement enables continuous and ordered cycling of the servers (i.e., server 0, server 1, . . . , server 5, server 0, etc.), and the disks attached to the server (i.e., disk 0, disk 1, disk 0, etc.). The network switch sequences among the servers to output a continuous data stream A to the requesting client.
Declustered Mirroring
Over time, components are expected to fail. To anticipate this possibility, the file server system 20 employs a data mirroring technique in which the primary data is duplicated and the redundant copy is also maintained on the disks. The data mirroring is illustrated conceptually in FIG. 2, wherein the disks are divided in half with the upper half of the disks storing the primary data and the lower half of the disks storing redundant data.
The two copies of each file are stored on separate servers, in case an entire server or disk fails. One way of accomplishing this is to store all of the data from server 0's disks redundantly on server 1's disks, all of the data from server 1's disks redundantly on server 2's disks, and so on. However, if server 0 were to fail in this arrangement, the workload of server 1 would double because it would have to support its original distribution of video data plus the distribution of video data for server 0. If each server is configured to support twice its workload, the servers are using only half of their resources during normal operation when there are no failures in the system.
To avoid this inefficiency, each block of the redundant data is split into multiple pieces, and the pieces are distributed among the disks of multiple servers. This process is known as “declustering”, and the number of pieces into which each block is split is known as the “decluster factor”.
FIG. 2 shows a disk configuration with a decluster factor of two, meaning there are two redundant pieces for every primary data block. The data for server 0's disks are stored redundantly on the disks of servers 1 and 2; the data for server 1's disk are stored redundantly on disks of servers 2 and 3; and so on. With a decluster factor of two, the mirror half of the storage disks can be further conceptualized as having two regions: a first region to store the first redundant piece (i.e., X.1) and a second region to store the second redundant piece (i.e., X.2). As an example, primary data block A0 (disk 0, server 0) is split into two redundant pieces “A0.1” and “A0.2” in which the first redundant piece A0.1 is stored in region 1 of disk 0 of server 1 and the second redundant piece A0.2 is stored in region 2 of disk 0 of server 2.
If the server carrying the primary data fails, the mirrored data on the other servers is used. Suppose, for example, that server 0 fails. When it comes time to serve data block A6 (originally on disk 1, server 0), server 1 reads and outputs the first redundant piece A0.1 and server 2 reads and outputs the second redundant piece A0.2.
The declustered mirroring technique results in a more even distribution of increased workload among the operable servers in the event that one server (or disk) fails. This is because when a component fails, several other servers share the work of making up for the failed component. In our example of a small decluster factor of two, the increased burden to a data server is only fifty percent (i.e., its own workload and half of the failed server's workload), rather than a doubling of workload that would be needed in the absence of declustering. As the decluster factor increases, the additional burden shared by the non-failed servers is reduced.
Centralized Disk Scheduling
Due to the striping arrangement and disk configuration shown in FIG. 2, all servers share in the distribution of a data stream, each supplying the ordered blocks of data in turn. This shared operation requires a mechanism to determine when each server should provide data for each stream. Such a mechanism is provided by a time-ordered schedule that specifies, for each server 24, when to read each block of data from disk and when to transmit this data over the network 32.
In one prior implementation, the file server system 20 relies on a centralized scheduler that is maintained by the central controller 22 (FIG. 1). With a centralized scheduler, the controller 22 periodically sends messages to the servers 24, telling them what operations to perform in the near future. The schedule is defined to guarantee that, once streams are admitted, they can be serviced in a deterministic fashion to ensure availability of system resources when needed to distribute the streams. Thus, the schedule serves both as a description of when data is to be read and transmitted and also as an indication of resource allocation. There are three main resources that are allotted to the data streams: disk bandwidth, network bandwidth, and buffer memory.
The schedule for a single-rate file server is one of disk operations, and hence is referred to as a “disk schedule”. The temporal length of the disk schedule is the block play time multiplied by the number of disks in the system. In the FIG. 2 example with 12 disks and a block play time of one second, the disk schedule has a temporal length of 12 seconds.
FIG. 3 shows a disk schedule 42 for a six-server, two-disk file system. The disk schedule 42 is divided into time slots 44, the width of which is determined by the amount of time necessary to service a single data block, a duration known as the “block service time”. This time is equal to the block play time divided by the number of streams that can be supported per disk. This number is not necessarily integral; a fractional number of streams per disk may be supported. If the stream distribution capacity of a particular instance of the file server 20 is limited by disk performance, the block service time is equal to the time to read one block of data from the disk, including both seek time and data transfer time. Alternatively, if the stream distribution capacity of a particular instance of the file server 20 is limited by some other factor, such as network performance or I/O bus bandwidth, the block service time is calculated as the block play time divided by the number of supported streams per server multiplied by the number of disks per server.
In FIG. 3, the block service time of the schedule 42 is one-half of the block play time (i.e., ½ second), indicating that each disk can support two data streams. Accordingly, each slot 44 is one-half second in duration, yielding twenty-four slots 44 in the twelve second disk schedule 42. The slots 44 are individually labeled as S0–S23 for identification purposes. In this example, the block service time is atypically high for ease of illustration. More typically, a disk can support between 5 and 20 data streams, depending upon the data transmission rate, resulting in a much lower block service time.
Each server's workload is kept low enough that there is sufficient remaining capacity for reading and transmitting declustered redundant blocks, in the event that a neighboring server fails. This is accomplished by increasing the block service time to allow for this additional workload. The exact factor by which this is increased depends upon the limiting resource in the system, but it is typically somewhat greater than 1/(decluster factor).
Requests for data files are assigned a slot in the schedule 42. Here, nine data streams 0–8 are presently scheduled. In theory, the disk schedule 42 determines when the disk read operations on each server are performed for each stream 0–8. In practice, disk reads are generally performed earlier than the scheduled times, although the lead time is bounded by a system configuration parameter. Network operations are not explicitly scheduled; rather, the beginning of each data transmission immediately follows the scheduled completion of the disk read.
As shown in FIG. 3, there is a pointer into the schedule 42 for each disk of each server, spaced at intervals of one block play time. The pointers are labeled in FIG. 3 in a format “Server #, Disk #” to reference the appropriate the server and disk. The pointers move to the right in this illustration, while the schedule 42 remains stationary. Every twelve seconds, each pointer winds up back where it started. At the instant shown in FIG. 3, disk 1 of server 3 is scheduled to be in progress of reading a data block for stream 5; disk 1 of server 1 is scheduled to read a block for stream 1; disk 0 of server 3 is scheduled to read a block for stream 3; and disk 0 of server 1 is scheduled to read a block for stream 4.
Even though data blocks are only being read for a fraction of the streams at any given time, data is being transmitted for all streams at all times. At the instant shown in FIG. 3, data is being transmitted for each stream from the server as indicated in the following table:
StreamServerDisk040101251320400521630701820
In the above table, server 0 is currently transmitting stream 1, while server 5 is concurrently transmitting stream 2, and so on. Notice also that while preceding servers are transmitting the data block, the next servers in order are reading the next data block from the disks. In this example, while server 0 is transmitting a block for stream 1, the next server 1 is currently reading the next block for stream 1. Server 1 will then transmit this next block following the transmission of the current block by server 0.
As time progresses, the controller 22 advances the pointers through the schedule 42, leading the actual value of time by some amount that is determined by the system configuration parameter. This lead allows sufficient time for processing and communication, as well as for reading the data from the disk. When the pointer for a server reaches a slot that contains an entry for a stream, the controller 22 determines which block should be read for that stream, and it sends a message to the appropriate server. The message contains the information for the server to process the read and transmission, including the block to be read, the time to begin the transmission, and the destination of the stream.
U.S. Pat. No. 5,473,362, entitled “Video on Demand System Comprising Stripped (sic) Data Across Plural Storable Devices With Time Multiplex Scheduling,” which was filed on Nov. 30, 1993 and issued on Dec. 5, 1995, in the names of Fitzgerald, Barrera, Bolosky, Draves, Jones, Levi, Myhrvold, Rashid and Gibson, describes the striping and scheduling aspects of the continuous media file server 20 in more detail. This patent is assigned to Microsoft Corporation and incorporated by reference. In this document, the file server described in U.S. Pat. No. 5,473,362 is generally referred to as a “centralized file server system”.
Scheduling New Streams: Greedy Policy
When a viewer requests that a new stream be started, the controller 22 first determines the server and disk on which the starting block resides. The controller 22 then searches for a free slot in the disk schedule 42, beginning shortly after the pointer for the indicated server and disk, and progressing sequentially until it finds a free slot.
For example, suppose that a new stream request to play stream 9 arrives at the instant shown in FIG. 3, and that the controller 22 determines that the starting block for new stream 9 resides on disk 1 of server 2 (i.e., Server 2, Disk 1). Furthermore, suppose that the minimum insertion lead time is equal to one block service time, i.e., one slot width.
The controller begins searching for a free slot, starting at one slot width to the right of the pointer for disk 1 of server 2. This point is mid-way through a slot S4, so there is not sufficient width remaining in the slot for the stream to be inserted. The controller proceeds to the next slot S5 to the right, which is occupied by stream 1, and thus not available for the new stream 9. Similarly, the next slot S6 is occupied by stream 7. The next slot S7 is unoccupied, however, so the new stream 9 is inserted to this slot S7.
To reach slot S7, the new stream insertion request slips by over two slots. If the block service time is 100 ms, the schedule slip induces a startup delay of over 200 ms, since it will take this additional amount of time before disk 1 of server 2 reaches slot S7.
The interval between the time a new stream request is received and the time that the content is actually served is known as “latency”. It is desirable to minimize stream startup latency experienced by a user. The insertion method just described employs a “greedy policy”. For each new stream, the selected schedule slot is the slot that minimizes startup latency experienced by the requesting viewer. That is, the greedy policy grabs the first available slot and inserts the new stream request into that slot.
The greedy policy has the desirable property of minimizing the mean startup latency over all stream insertions and all schedule loads. Early users in the schedule experience very short latencies. Unfortunately, late comers to the schedule (i.e., the last few requests in an almost fully loaded schedule) experience excessive latencies while the controller is seeking to find an open slot.
Large startup latencies at high loads are caused by the presence in the schedule of large clusters of contiguously allocated slots. For instance, suppose in FIG. 3 that slots S0–S18 are filled and a new request is received for a server and disk whose pointer is currently referencing slot S0. The 18-slot slippage causes excessive latency in comparison to the above example of a 2-slot slippage.
Some amount of schedule clustering is virtually unavoidable; however, the greedy algorithm has a strong tendency to grow clusters for two reasons. First, the likelihood of a schedule insertion in the slot immediately following a cluster is proportional to the length of that cluster, so long clusters tend to grow longer. Second, two clusters near each other will be joined into a single cluster when the intervening slots are filled. Because of this second phenomenon, startup latency grows much faster than linearly as the schedule load approaches unity.
Mean latency may not be an appropriate metric for evaluating user satisfaction. Mean behavior measures the aggregate effect of many schedule insertions, but each viewer experiences a startup latency corresponding to a single insertion. A user who experiences the annoyance of an extraordinarily long delay is unlikely to be appeased by the knowledge that a large number of other users were serviced in a far more timely fashion. In addition, user satisfaction does not vary linearly with response time. For instance, the benefit from reducing one viewer's startup latency from ten seconds to one second exceeds the total benefit from reducing ten viewers' startup latencies from two seconds to one second.
Scheduling New Streams: Thrifty Policy
Thrifty scheduling attempts to improve perceived system responsiveness by reducing startup latencies that are relatively high at the expense of increasing startup latencies that are relatively low, even if doing so increases the mean startup latency. The thrifty policy accepts any startup latency not exceeding a given value. The thrifty policy is greedy in reducing startup latency in excess of this acceptable value, but it may sacrifice latency within the acceptable range for the sake of reducing the latency of later schedule insertions.
The thrifty policy is fairly straightforward. When a new stream is requested, it examines all available slots within the acceptable range and chooses the slot that minimizes the clustering in the schedule, as determined by a metric that quantifies the degree of clustering. In the event of a fie, or if no slots are available within the acceptable range, the thrifty policy selects the slot that results in the lowest startup latency.
The thrifty policy for the centralized file server system is described in U.S. Pat. No. 5,642,152, entitled “Method and System for Scheduling the Transfer of Data Sequences Utilizing an Anti-Clustering Scheduling Algorithm,” which was filed on Dec. 6, 1994 and issued on Jun. 24, 1997, in the names of Douceur and Bolosky. This patent is assigned to Microsoft Corporation and incorporated by reference.
The thrifty policy described in the '152 patent makes several demands on the system. For instance, calculation of the clustering metric requires access to the entire schedule. This is not a problem for the centralized file server system because the complete schedule is kept at the central controller 22. Another constraint in the centralized case is that the new stream requests are not queued. When a new stream is requested, it is assigned to the appropriate slot upon request, rather than being queued for later insertion. While these constraints are acceptable in the centralized case, they cannot be supported in the distributed case.
Distributed Disk Scheduling
In the centralized file server system described above, the controller 22 maintains the entire schedule for all data servers 24. In a second design, there is no one complete schedule. Instead, the schedule is distributed among all of the data servers 24 in the system, such that each server holds a portion of the schedule but, in general, no server holds the entire schedule.
The disk schedule in the distributed system is conceptually identical to the disk schedule in the centralized system. However, the disk schedule is implemented in a very different fashion because it exists only in pieces that are distributed among the servers. Each server holds a portion of the schedule for each of its disks, wherein the schedule portions are temporally near to the schedule pointers for the server's associated disks. The length of each schedule portion dynamically varies according to several system configuration parameters, but typically is about three to four block play times long. In addition, each item of schedule information is stored on more than one server for fault tolerance purposes.
Periodically, each server sends a message to the next server in sequence, passing on some of its portions of the schedule to the next server that will need that information. This schedule propagation takes the form of messages called “viewer state records”. Each viewer state record contains sufficient information for the receiving server to understand what actions the receiving server must perform for the schedule entry being passed. This information includes the destination of the stream, a file identifier, the viewer's position in the file, the temporal location in the schedule, and some bookkeeping information.
U.S. Pat. No. 5,867,657, entitled “Distributed Scheduling in a Multiple Data Server System,” which was filed Jun. 6, 1996, and issued Feb. 2, 1999 in the names of Bolosky and Fitzgerald, describes a method for distributing the schedule management among the data servers 24. This application is assigned to Microsoft Corporation and incorporated by reference. In this document, the file server described in this U.S. Patent is generally referred to as a “distributed file server system”.
The distributed file server system employs the greedy policy to handle new stream requests. When a request to insert a new data stream is received at the controller, it notifies the data server 24 that holds the starting block of the new stream request. The data server adds the request to a queue of pending service requests.
The data server then evaluates its own portion of the schedule to decide whether an insertion is possible. Associated with each schedule slot in the distributed schedule is a period of time, known as an “ownership period”, that leads the slot by some amount. The server whose disk points to the ownership period in the schedule is said to own the associated slot. The ownership period leads the associated slot by somewhat more than a block service time. This lead ensures that the data server that schedules a new stream for a slot has sufficient time for processing and communication, as well as for reading the data from the disk.
When a server obtains ownership of a slot, the server examines the slot to determine whether the slot is available to receive the new data stream. If it is, the server removes the request from the queue and assigns the stream to the slot. This assignment is performed by generating a viewer state record according to the information in the stream request. This viewer state record is treated in the same manner as a viewer state record received from a neighboring server.
While the greedy policy is effective for the distributed file server system, it possesses the same drawbacks as described above in the context of the centralized file server system. Namely, the greedy policy minimizes the mean startup latency over all stream insertions and all schedule loads at the undesirable expense of having later users experience excessive latencies.
It would be beneficial to adopt a thrifty policy for use on the distributed file ii server system. However, the distributed schedule complicates the thrifty policy in several ways. First, since only a portion of the schedule is visible to each data server at any time, the scheduling technique must make decisions based upon purely local data. Second, since a data server owns only one slot at a time, the scheduling technique cannot decide exactly where in the schedule to insert a new stream; it can decide only whether or not to insert the new stream into the currently owned slot. Furthermore, since a data server may not schedule a stream as soon as it receives the start play request, multiple requests can accumulate in its pending service queue, and the scheduling algorithm will need to account for these queued stream requests in addition to the streams already in the schedule.
Accordingly, there is a need to develop a thrifty scheduling policy that can be implemented in a distributed file server system.