A continuous media file server system is designed to serve continuous data streams, such as audio and video data files, to multiple clients. As an example, a file server system might simultaneously supply multiple digital data streams, each in the 1-10 megabits-per-second (Mb/s) range, to thousands of clients.
General Architecture
FIG. 1 shows a continuous media file server system 20 developed by Microsoft Corporation. The file server system is a distributed, scalable, and fault-tolerant server that can serve many continuous data streams simultaneously to a large number of clients. The file server system 20 has a central controller 22 connected to multiple data servers 24(1), 24(2), 24(3), . . . , 24(K) via a low bandwidth control network 26. The controller 22 receives requests from clients, such as requests for starting and stopping a particular data file. The controller 22 is responsible for initiating delivery of streaming content to the requesting clients, including such tasks as locating the data server that holds the first block of data in the requested data file. The controller and data servers can be implemented, for example, as general purpose computers.
Each data server 24 supports at least one storage disk, as represented by storage disks 28(1), 28(2), . . . , 28(M) connected to data server 24(1). The disks 28 are attached to their respective data server 24 via one or more buses 30 (e.g., SCSI, Fiber Channel, EIDE, etc.). The number and configuration of storage disks is flexible, but within a given file server 20, all data servers 24 support the same number of storage disks 28. The storage disks can store large amounts of digital data, with example disk capacities of many Gigabytes. The storage capacity of the entire media file server 20 consists of the usable storage space on the storage disks. An operator can change the storage capacity of the file server by adding or removing one or more storage disks to or from each data server, or adding or removing one or more of the data servers to which the disks are connected.
The data servers 24 are connected to a high-speed network switch 32 via network interfaces 34 (e.g., network card). The network switch 32 takes the data segments read from the storage disks, orders them into a continuous stream, and distributes the streams over a network to the clients. The network switch 32 also provides high bandwidth, parallel communication between the data servers 24. Additionally, the controller 22 may be connected to the data servers 24 through the network switch 32, as opposed to a separate control network 26. As an example, the network switch 32 can be implemented using fiber optics and ATM (Asynchronous Transfer Mode) switches.
Each data server 24 contains a memory buffer, as represented by buffer 36 in data server 24(1). The buffer 36 temporarily stores data that is read from the disks 28(1)-28(M) and is to be output to the network switch 32.
The continuous media file server system 20 can be implemented in different contexts. For instance, the file server system 20 might function as a head end server in an interactive television (ITV) system which serves audio and video files over a distribution network (e.g., cable, satellite, fiber optic, etc.) to subscriber homes. The file server system 20 might alternatively operate as a content provider that distributes data files over a network (e.g., Internet, LAN, etc.) to multiple client computers.
Data Striping
It is likely that some pieces of content will be more popular than others. For example, the top ten percent of movies ordered by popularity might garner 70% of the load, while the remaining 90% of the content attracts only 30% of the viewers. To avoid disproportionate use of storage disks 28 and data servers 24 (i.e., by overburdening the disks and data servers holding popular content while leaving other disk and data servers underutilized), the continuous media file server system 20 stripes all of the data files across all of the storage disks 28 and all of the data servers 24. When a client requests a data stream, all data servers 24 share in the distribution of that stream, each supplying a portion of the data stream in turn. In this way, the load is spread over all of the storage disks 28 and data servers 24 regardless of the data file's popularity.
Prior to this invention, the data streams were served at a constant data transmission bit rate. With this assumption, each data file could be broken into "blocks" of fixed temporal width. A block represented the amount of physical space allocated on a disk to hold one time unit of data, and could be expressed in terms of bytes. The temporal duration required to play the data in the block is known as a "block play time". For a data rate of 1 Mb/s, for example, the block size might be 1 Megabit and the block play time might be one second. In the conventional file server, a single block play time is established for all data files, resulting in a fixed-size data block.
FIG. 2 shows an example file server disk array 40 consisting of six data servers 0-5, each supporting two storage disks. Each disk stores data blocks, as represented by the labeled rectangles such as "A0", "A6", etc. Data files are striped across every storage disk of every server. For each data file, a starting disk is chosen to hold the first data block. For instance, the first block of data file A, designated as block "A0", is stored on disk 0 of data server 0. A server index is incremented, and the next block in the file (i.e., block "A1") is placed on disk 0 of server 1. The striping continues across the first disks of each server.
When the last server 5 is reached, the striping pattern wraps and continues with the next disks of each server. More specifically, when the server index reaches the number of servers in the system, a disk index is incremented (modulo the number of disks per server) and the server index is reset to 0. In FIG. 2, after data block A5 is placed on disk 0 of server 5, the next block in the file (i.e., block "A6") is placed on disk 1 of server 0. Block A7 is then placed on disk 1 of server 1, and so on. This process continues until all the data blocks of the video file have been assigned to disks.
The process is then repeated for each subsequent data file. Typically, the striping pattern starts the various data files on different starting disks. In FIG. 2, two data files A and B are shown. Data file A begins on disk 0 of server 0, and data file B begins on disk 0 of server 1.
The striping pattern generally prescribes that the data blocks are sequentially ordered across ordered disks, but the sequential blocks need not reside at the same physical block address on adjacent disks. For instance, the striping pattern of files A and B result in the storage of sequential blocks B3 (disk 4, server 0) and B4 (disk 5, server 0) at different physical locations on the two disks (location 3 for block B3 and location 2 for block B4). Accordingly, sequential data blocks can reside at entirely different physical block locations within the contiguous disks. The block locations in the disk array are described by file metadata that is stored either in memory or on disk. It is noted that other patterns are possible.
To play a data file, the file server system 20 serves the data blocks sequentially from the storage disks, one block at a time. The data blocks are read from each disk, stored temporarily in buffer memory 36 at the server 24, and transmitted to the network switch 32 in order. When file A is requested by a client, for example, block A0 is read from disk 0 (server 0) and transmitted via server 0 to the network switch for the duration of a block play time. Next, block A1 is read from disk 0 (server 1) and transmitted via server 1 to the network switch for the duration of a block play time. The striping arrangement enables continuous and ordered cycling of the servers (i.e., server 0, server 1, . . . , server 5, server 0, etc.), and the disks attached to the server (i.e., disk 0, disk 1, disk 0, etc.). The network switch sequences among the servers to output a continuous data stream A to the requesting client.
Declustered Mirroring
Over time, components are expected to fail. To anticipate this possibility, the file server system 20 employs a data mirroring technique in which the primary data is duplicated and the redundant copy is also maintained on the disks. The data mirroring is illustrated conceptually in FIG. 2, wherein the disks are divided in half with the upper half of the disks storing the primary data and the lower half of the disks storing redundant data.
The two copies of each file are stored on separate servers, in case an entire server or disk fails. One way of accomplishing this is to store all of the data from server 0's disks redundantly on server 1's disks, all of the data from server 1's disks redundantly on server 2's disks, and so on. However, if server 0 were to fail in this arrangement, the workload of server 1 would double because it would have to support its original distribution of video data plus the distribution of video data for server 0. If each server is configured to support twice its workload, the servers are using only half of their resources during normal operation when there are no failures in the system.
To avoid this inefficiency, each block of the redundant data is split into multiple pieces, and the pieces are distributed among the disks of multiple servers. This process is known as "declustering", and the number of pieces into which each block is split is known as the "decluster factor".
FIG. 2 shows a disk configuration with a decluster factor of two, meaning there are two redundant pieces for every primary data block. The data for server 0's disks are stored redundantly on the disks of servers 1 and 2; the data for server 1's disk are stored redundantly on disks of servers 2 and 3; and so on. With a decluster factor of two, the mirror half of the storage disks can be further conceptualized as having two regions: a first region to store the first redundant piece (i.e., X.1) and a second region to store the second redundant piece (i.e., X.2). As an example, primary data block A0 (disk 0, server 0) is split into two redundant pieces "A0.1" and "A0.2" in which the first redundant piece A0.1 is stored in region 1 of disk 0 of server 1 and the second redundant piece A0.2 is stored in region 2 of disk 0 of server 2.
If the server carrying the primary data fails, the mirrored data on the other servers is used. Suppose, for example, that server 0 fails. When it comes time to serve data block A6 (originally on disk 1, server 0), server 1 reads and outputs the first redundant piece A0.1 and server 2 reads and outputs the second redundant piece A0.2.
The declustered mirroring technique results in a more even distribution of increased workload among the operable servers in the event that one server (or disk) fails. This is because when a component fails, several other servers share the work of making up for the failed component. In our example of a small decluster factor of two, the increased burden to a data server is only fifty percent (i.e., its own workload and half of the failed server's workload), rather than a doubling of workload that would be needed in the absence of declustering. As the decluster factor increases, the additional burden shared by the non-failed servers is reduced.
Centralized Disk Scheduling
Due to the striping arrangement and disk configuration shown in FIG. 2, all servers share in the distribution of a data stream, each supplying the ordered blocks of data in turn. This shared operation requires a mechanism to determine when each server should provide data for each stream. Such a mechanism is provided by a time-ordered schedule that specifies, for each server 24, when to read each block of data from disk and when to transmit this data over the network 32.
In one prior implementation, the file server system 20 relies on a centralized scheduler that is maintained by the central controller 22 (FIG. 1). With a centralized scheduler, the controller 22 periodically sends messages to the servers 24, telling them what operations to perform in the near future. The schedule is defined to guarantee that, once streams are admitted, they can be serviced in a deterministic fashion to ensure availability of system resources when needed to distribute the streams. Thus, the schedule serves both as a description of when data is to be read and transmitted and also as an indication of resource allocation. There are three main resources that are allotted to the data streams: disk bandwidth, network bandwidth, and buffer memory.
The schedule for a single-rate file server is one of disk operations, and hence is referred to as a "disk schedule". The temporal length of the disk schedule is the block play time multiplied by the number of disks in the system. In the FIG. 2 example with 12 disks and a block play time of one second, the disk schedule has a temporal length of 12 seconds.
FIG. 3 shows a disk schedule 42 for a six-server, two-disk file system. The disk schedule 42 is divided into time slots 44, the width of which is determined by the amount of time necessary to service a single data block, a duration known as the "block service time". This time is equal to the block play time divided by the number of streams that can be supported per disk. If the stream distribution capacity of a particular instance of the file server 20 is limited by disk performance, the block service time is equal to the time to read one block of data from the disk, including both seek time and data transfer time. Alternatively, if the stream distribution capacity of a particular instance of the file server 20 is limited by some other factor, such as network performance or I/O bus bandwidth, the block service time is calculated as the block play time divided by the number of supported streams per server multiplied by the number of disks per server.
In FIG. 3, the block service time of the schedule 42 is one-half of the block play time (i.e., 1/2 second), indicating that each disk can support two data streams. Accordingly, each slot 44 is one-half second in duration, yielding twenty-four slots 44 in the twelve second disk schedule 42. In this example, the block service time is atypically high for ease of illustration. More typically, a disk can support between 5 and 20 data streams, depending upon the data transmission rate, resulting in a much lower block service time.
Each server's workload is kept low enough that there is sufficient remaining capacity for reading and transmitting declustered redundant blocks, in the event that a neighboring server fails. This is accomplished by increasing the block service time to allow for this additional workload. The exact factor by which this is increased depends upon the limiting resource in the system, but it is typically somewhat greater than 1/(decluster factor).
Requests for data files are assigned a slot in the schedule 42. Here, nine data streams 0-8 are presently scheduled. In theory, the disk schedule 42 determines when the disk read operations on each server are performed for each stream 0-8. In practice, disk reads are generally performed earlier than the scheduled times, although the lead time is bounded by a system configuration parameter. Network operations are not explicitly scheduled; rather, the beginning of each data transmission immediately follows the scheduled completion of the disk read.
As shown in FIG. 3, there is a pointer into the schedule 42 for each disk of each server, spaced at intervals of one block play time. The pointers are labeled in FIG. 3 as, for example, "Server 3, Disk 1" to reference the appropriate the server and disk. The pointers move to the right in this illustration, while the schedule 42 remains stationary. Every twelve seconds, each pointer winds up back where it started. At the instant shown in FIG. 3, disk 1 of server 3 is scheduled to be in progress of reading a data block for stream 5, disk 1 of server 1 is scheduled to read a block for stream 1, disk 0 of server 3 is scheduled to read a block for stream 3, and disk 0 of server 1 is scheduled to read a block for stream 4.
Each though data blocks are only being read for a fraction of the streams at any given time, data is being transmitted for all streams at all times. At the instant shown in FIG. 3, data is being transmitted for each stream from the server as indicated below:
______________________________________ Stream Server Disk ______________________________________ 0 4 0 1 0 1 2 5 1 3 2 0 4 0 0 5 2 1 6 3 0 7 0 1 8 2 0 ______________________________________
In the above table, server 0 is currently transmitting stream 1, while server 5is concurrently transmitting stream 2, and so on. Notice also that while preceding servers are transmitting the data block, the next servers in order are reading the next data block from the disks. In this example, while server 0 is transmitting a block for stream 1, the next server 1 is currently reading the next block for stream 1. Server 1 will then transmit this next block following the transmission of the current block by server 0.
As time progresses, the controller 22 advances the pointers through the schedule 42, leading the actual value of time by some amount that is determined by the system configuration parameter. This lead allows sufficient time for processing and communication, as well as for reading the data from the disk. When the pointer for a server reaches a slot that contains an entry for a stream, the controller 22 determines which block should be read for that stream, and it sends a message to the appropriate server. The message contains the information for the server to process the read and transmission, including the block to be read, the time to begin the transmission, and the destination of the stream.
When a viewer requests that a new stream be started, say stream 9, the controller 22 first determines the server and disk on which the starting block resides. The controller 22 then searches for a free slot in the disk schedule 42, beginning shortly after the pointer for the indicated server and disk, and progressing sequentially until it finds a free slot.
For example, suppose that a new stream request arrives at the instant shown in FIG. 3, and that the controller 22 determines that the starting block for new stream 9 resides on disk 1 of server 2. Furthermore, suppose that the minimum insertion lead time is equal to one block service time, i.e., one slot width. The controller begins searching for a free slot, starting at one slot width to the right of the pointer for disk 1 of server 2. This point is mid-way through a slot S1, so there is not sufficient remaining width in the slot for the stream to be inserted. The controller proceeds to the next slot S2 to the right, which is occupied by stream 1. Thus, slot S2 is not available for the new stream 9. Similarly, the next slot S3 is occupied by stream 7, so the new stream 9 is inserted to the right of this slot, at slot S4. The viewer experiences a stream startup delay that is proportional to the temporal distance passed in the search for a free slot, which is kept to a minimum.
Buffer Usage
When the disk read is performed, the data is transferred from the disk 28 into buffer memory 36 using direct memory access (DMA). Subsequently, the server performs a network transmission in which the data is transferred from buffer memory 36 to the network interface 34. As a result, buffer memory is required for each block from the beginning of the block read to the completion of the block transmission.
FIG. 4 shows the buffer utilization. Suppose the disk read is scheduled to read a block at time T.sub.1, as shown in the time line labeled "Disk Schedule". As mentioned above, the read may begin sooner within some Max Lead Time before the scheduled read, which is set as a system parameter. Accordingly, the earliest that a disk might be read is at time T.sub.0, as indicated in the time line labeled "Earliest Disk Usage."
Prior to the beginning of the disk read, no buffer memory for the stream is required. The curve in the chart labeled "Buffer Usage" is thus at zero prior to the earliest possible read time at T.sub.0. Buffer memory is allocated just before the disk read occurs, (i.e., on or just before T.sub.0), as indicated by the steep upward step in the buffer usage curve to some X Mbytes.
Upon conclusion of the scheduled read time (i.e., time T.sub.2), the data is transmitted from the buffer memory 36 to network interface 34. The data is output during a block transmission time, as indicated by the time line labeled "Network Usage". The buffer memory is deallocated after the network transmission completes, as indicated by the steep downward step at time T.sub.3.
Since there is a bounded lead between the actual disk read and the scheduled disk read, and there is a fixed lag between the scheduled disk read and the network transmission, the usage of buffer memory is completely determined by the disk schedule. Thus, a single schedule serves to allocate disk, network, and buffer usage.
U.S. Pat. No. 5,473,362, entitled "Video on Demand System Comprising Stripped (sic) Data Across Plural Storable Devices With Time Multiplex Scheduling," which was filed on Nov. 30, 1993 and issued on Dec. 5, 1995, in the names of Fitzgerald, Barrera, Bolosky, Draves, Jones, Levi, Myhrvold, Rashid and Gibson, describes the striping and scheduling aspects of the continuous media file server 20 in more detail. This patent, which is assigned to Microsoft Corporation, is incorporated by reference. In this document, the file server described in U.S. Pat. No. 5,473,362 is generally referred to as a "centralized single-rate file server system".
Distributed Disk Scheduling
The server system described above has a centralized schedule maintained at the controller 22. In a second design, the schedule is distributed among all of the data servers 24 in the system, such that each server holds a portion of the schedule but, in general, no server holds the entire schedule.
The disk schedule in the distributed system is conceptually identical to the disk schedule in the centralized system. However, the disk schedule is implemented in a very different fashion because it exists only in pieces that are distributed among the servers. Each server holds a portion of the schedule for each of its disks, wherein the schedule portions are temporally near to the schedule pointers for the server's associated disks. The length of each schedule portion dynamically varies according to several system configuration parameters, but typically is about three to four block play times long. In addition, each item of schedule information is stored on more than one server for fault tolerance purposes.
Periodically, each server sends a message to the next server in sequence, passing on some of its portions of the schedule to the next server that will need that information. This schedule propagation takes the form of messages called "viewer state records". Each viewer state record contains sufficient information for the receiving server to understand what actions the receiving server must perform for the schedule entry being passed. This information includes the destination of the stream, a file identifier, the viewer's position in the file, the temporal location in the schedule, and some bookkeeping information. For reasons of fault tolerance, viewer state records are forwarded not only to the next server in sequence but also to the server following that one, so that, in case the next server has failed, the viewer state record will not be lost. This strategy implies that duplicate viewer state records are often received, which are dealt with simply by ignoring them.
When a request to insert a new data stream is received at the controller, it notifies the data server that holds the starting block of the new stream request. The data server then evaluates its own portion of the schedule to decide whether an insertion is possible. Associated with each schedule slot in the distributed schedule is a period of time, known as an "ownership period", that leads the slot by some amount. The server whose disk points to the ownership period in the schedule is said to own the associated slot. The ownership period leads the associated slot by somewhat more than a block service time. This lead ensures that the server that schedules a new stream for a slot has sufficient time for processing and communication, as well as for reading the data from the disk.
When a server obtains ownership of a slot, the server examines the slot to determine whether the slot is available to receive the new data stream. If it is, the server assigns the stream to the slot. This assignment is performed by generating a viewer state record according to the information in the stream request. This viewer state record is treated in the same manner as a viewer state record received from a neighboring server.
U.S. patent application Ser. No. 08/684,840, entitled "Distributed Scheduling in a Multiple Data Server System," which was filed Jun. 6, 1996, in the names of Bolosky and Fitzgerald, describes a method for distributing the schedule management among the data servers 24. This application is assigned to Microsoft Corporation and is incorporated by reference. In this document, the file server described in this U.S. Patent Application is generally referred to as a "distributed single-rate file server system".
Multi-Rate Media Distribution
An assumption underlying the prior art architecture of the media file server system 20 is that all data streams have the same data rate. However, in practice, various data streams have different data rates. For example, in video data, the amount of visual information varies greatly according to the content. High-action video, such as a sporting event, requires a greater amount of information per second in comparison to low-action video, such as a talking head. In some environments, users may wish to trade off picture quality versus cost, or perhaps some clients have access to higher-definition video-display devices than others. In addition, different content or transmission standards may also dictate different data rates.
For these reasons, it is desirable to provide a continuous media file server that can play multiple data streams at different data rates. One possible implementation is to configure the file server for the highest of several data rates, thereby accepting inefficient use of disk and network bandwidth for streams of lower data rates. For systems with few low-data-rate streams relative to the number of high-data-rate streams, this approach may be acceptable. In general, it results in an excessive waste of expensive resources.
Thus, there exists a need for a scheduling mechanism that allows the file server to simultaneously supply multiple data streams of differing data transmission rates while making efficient use of disk and network resources.