The principal requirements of a video server are the abilities to store multiple video files, as well as to continually stream any one of these files to any one of a server's multiple clients. A typical large-scale server will hold several hundred video files and be capable of streaming this content to several hundred simultaneous clients. In order for the clients to view the video without interruption, and without a large buffer required at each client site, the server must output each client's stream without interruption. Further, each client must have access to any video file on the server, so that, for example, every client could view the same file simultaneously, or each client could view a different file. Video servers may be capable of “VCR-like” functionality to display a video file in normal, fast-forward, or rewind mode. This functionality generates an additional requirement on the server that a user's viewing mode changes do not incur a long delay, such as, for example, during changes from “normal” mode to “fast-forward” mode.
DIVA Systems, Inc. of Redwood City, Calif. meets these video server requirements using a server design that stripes the multiple video files across an array of hard disk drives (hereinafter “disk drives”). In one type of server configuration, the server streams the video files at multiple constant bitrates (MCBR). Every video file on a MCBR server is streamed out at a constant bitrate, and that bitrate may be different amongst different files on the server. A given video file on a MCBR server is divided into constant sized segments called “extents,” with all the data in a given extent written contiguously on one hard disk in the server's disk drive array. The amount of time it takes to output one extent of a video file is called the “service period.” Since each extent in a given video file is the same size, and since the file is output at a constant bitrate, that service period is the same for each extent of a given video file. Accordingly, one design for a MCBR server makes the service period the same for each of the server's files. As such, if file A is illustratively output at twice the bitrate of file B, then file A's extent size is twice the size of file B's.
In order to allow any (or all) of the server's clients to view a given video file at the same time, the extents are striped across the server's disk drive array. For example, if the disk drive array has D disk drives numbered 0 through D-1, and a given video file on the server has N extents numbered 0 through N-1. Then, if extent 0 is stored on disk drive J, extent 1 will be stored on disk drive J+1 (modulo D), and extent 2 will be on disk drive J+2 (modulo D), and so on. In this manner a client viewing the file is logically “walked” around the disk drive array, reading one or a few extents at a time, outputting those extents, and then reading more. All supported users on a given server may view the same file because the file is not isolated on any one disk drive. Further, the server can support multiple clients viewing different files of different bitrates because all the clients “walk” around the disk drive array at the same rate and in sync, since the service period is the same for all of the server's files. Because disk drives occasionally fail, the data striping generally uses some form of RAID parity protection, so that the data of the failed disk drive can be regenerated if a single disk drive fails.
FIG. 1 depicts a disk drive array 100 having data striped in a RAID-3 format. Specifically, the top row of boxes represents each disk drive 102 in the array of disks (e.g., 15 disk drives D0 through D14). Furthermore, each box below each disk drive in the array of disks represents an extent of data 1101 though 110p (collectively extents 110). FIG. 1 illustratively shows two files, file A and file B, each 16 extents long, striped across a disk drive array consisting of fifteen disk drives total. The disk drive array 100 is illustratively broken into three parity groups 1041 through 1043 (collectively parity groups 104) of five disk drives each, with each parity group 104 respectively consisting of four data disk drives 1061 through 1063 (collectively data disk drives 106) and one parity disk drive 1081 through 1083 (collectively parity disk drive 108). For example, the first parity group 104 comprises the first four extents of file A (i.e., extents A0–A3) illustratively written onto disk drives D5–8, plus the parity extent (i.e., the byte-by-byte XOR of these 4 data extents) written onto disk drive D9. In RAID 3, all files on the server use the same sized parity groups, so that certain disk drives in the array contain only parity data. In FIG. 1, the disk drives containing only parity data are disk drives 4, 9, and 14.
Reads from the RAID 3 formatted disk drive array 100 can proceed according to two different modes of operation. In a first mode of operation, a server must provide realtime correction of a failed extent read attempt without any delay. In this instance, all 5 extents in a parity group need to be read simultaneously. That is, all of the extents 110 in a parity group 104 must be read simultaneously so that a parity correction (i.e., using an XOR Boolean logic operative) can be performed immediately if any one of the 4 data extents 106 is unsuccessfully read.
In a second mode of operation, the server uses the parity data only for regeneration of data in the advent of a failed disk drive. As such, the extents 110 can be read sequentially one-by-one, with extent “1” read one service period after extent “0”, and extent “2” read one service period after extent “1” and so forth, and with the parity extents 104 not read at all in normal operation.
One problem with the realtime correction mode of operation is that the parity data 108 is read from the disk drive array 100 even when the disk drive array has no failed disk drive and is experiencing no disk drive errors in general. As such, the wasted disk drive-to-server bandwidth reading the parity disk drive adds to the cost for the ability to perform immediate correction of any failed read attempt.
Another problem associated with realtime parity correction is related to the amount of buffer memory on the server. In particular, the server must hold about twice as much memory as required for reading extents sequentially as compared to a RAID 5 format. This large memory capacity in the server is required in order to hold a full parity group's worth of user data at a time, rather than just 1–2 extents. Alternately, the extent size could be decreased by about 50%, which would keep the total amount of server memory the same. However, reducing the extent size drastically impairs the efficiency of extracting data from the hard disk drives. Therefore, the price in terms of buffer memory and unused disk drive bandwidth in realtime parity correction in a RAID 3 format is substantial. Moreover, in either real-time or non-real-time modes during normal operation under the RAID 3 format, the dedicated parity disk drive does not provide useful bandwidth with respect to streaming data.
FIG. 2 illustratively depicts a disk drive array 200 having data striped in a RAID 5 format. Specifically, FIG. 2 shows the same two files “A” and “B” of FIG. 1 striped across a disk drive array 200 consisting of 12 disk drives (recall that in the RAID 3 example, 15 disk drives were illustratively used). The parity group 104 is the same size as in FIG. 1 (1 parity extent 108 for four data extents 106). For example, the first parity group 104 for the file “A” comprises data extents A0–A3 106 plus the single parity extent 108.
One distinction between the RAID 5 format of FIG. 2 and the RAID 3 format of FIG. 1 is that the RAID 5 format does not use dedicated parity disk drives. Referring to FIG. 1, every file in the RAID 3 system had to start on either disk drive 0, 5 or 10 in order to keep the parity groups aligned. In a RAID 5 system, however, a file can have its first extent on any disk drive in the array so that the parity groups 104 do not align between different files. In fact, the parity data 106 must be evenly distributed across all disk drives in the array for the RAID 5 system to perform properly. For example, in FIG. 2, the data extents A0–A3 104 are stored on disk drives D3–D6, and the corresponding parity extent 108 is stored on disk drive D7. The next parity group 104 (i.e., extents A4–A7) begins on the disk drive (i.e., D7) following the last disk drive of the previous data extents (i.e., disk drive D6). As such, each successive disk drive stores at least one extent of data 106, and may store a parity extent 108 as well.
The advantages of RAID 5 over RAID 3 are twofold: First, for a given amount of server buffer memory, larger extents can be used compared to the real-time RAID 3 mode, thereby allowing data to be extracted more efficiently off each hard disk drive. This follows since the data reads can proceed by a single extent at a time, rather than a full parity group at a time. The second advantage is with regard to disk drive-to-server bandwidth efficiency. In particular, the D disk drives in a RAID 5 array provide D disk drives worth of true (non-parity) data bandwidth. By contrast, the D disk drives in a RAID 3 array provide only D*P/(P+1) disk drives worth of true data bandwidth, where P is the number of data extents in a parity group (e.g., P=4 in FIGS. 1 and 2). Thus, in a RAID 3 format, one disk drive out of each P+1 does not deliver true data.
One problem in a RAID 5 implementation is that there is no dedicated parity drive for real-time recovery when a disk fails. Thus the stream capacity will be reduced in the failure case. Also, in a RAID 5 implementation, the buffer is typically optimized for the normal case for practical reasons including cost. Thus there is a need in the art for handling disk failures in a video server implementing RAID 5 striping so as to maximize the number of streams supported during a disk failure given practical resource limitations.