1. Field of the Invention
The present invention relates to a video storage server, and more particularly, to a video-on-demand storage server which supports a large number of concurrent streams of data.
2. Description of the Related Art
Video-on-demand (VOD) storage servers include multiple magnetic disk drives and are required to support a large number of concurrent streams of data. With compressed video data, the rate of each stream is several times lower than the sustained transfer rate of a single magnetic disk drive. Since the cost of storage devices is a large portion of the cost of the VOD storage server, and the large number of concurrent streams often makes the storage subsystem bandwidth-limited rather than capacity-limited, it is desirable to make the most efficient use of the disk drives so as to minimize the cost per unit storage. Consequently, the throughput of the server is an important design goal.
Generally speaking, VOD storage servers typically operate by reading "chunks" of data from magnetic disk drives into a buffer memory, and then sending the content of the buffer memory in small "cells" to a destination over a communication network. The sending of the small "cells" is referred to as "streaming." The rate of cells per video stream is dictated by the rate for that stream. Once a user begins viewing a stream of data, the server must not overrun the amount of the available buffering when reading additional data, nor should the buffer memory be allowed to become empty. In effect, the buffer memory "smooths" the data transfer from the magnetic disks to the communication network.
With video data, the semiconductor storage memory required for buffering is another large portion of the overall cost of the VOD storage server. Hence, it is an important design goal to keep down the amount of required buffer memory. Data placement on a magnetic disk and the scheduling of retrieval of that data into buffer memory are therefore also important considerations in the design of the VOD storage servers. Specifically, placement and scheduling determine the maximum number of concurrent streams, the response time to user requests, as well as the amount of buffering required to mask the variability in the rate at which chunks for any given stream are actually retrieved from the magnetic disks and the difference between the disk and stream rates.
The use of storage systems for video data differs significantly from the use of storage systems in other applications. For example, in scientific computing or medical imaging systems, disk arrays are used to meet the single-stream rate requirements. As another example, when disks are used in on-line transaction processing, the number of accesses to small, unrelated blocks of data per unit time is most important, with "smoothness" not having any meaning and data throughput being of secondary importance.
In VOD storage systems, cost, throughput and smoothness are important design considerations. In order to be able to utilize the transfer bandwidth of all disk drives regardless of the viewing choices made by the users, as well as for other reasons, it is common practice to "stripe" each movie across many, often all, the disk drives. This entails recording a first chunk of a movie on a first disk drive, the next chunk on the next one, etc., eventually returning to the first one and beginning another round. Striping is well known and has been used both for "load balancing" and to maximize the transfer rate for a single large request. The size of a chunk is chosen so as to keep the fraction of time during which a disk can actually transfer data (as opposed to moving the reading head) high. Increasing the chunk size, however, increases the required size of the buffer memory and may also result in a longer response time to new user requests.
The use of a large number of disk drives gives rise to the problem of system unavailability due to disk failure. This problem is aggravated by the striping of the data across the disk drives, since the failure of any single disk renders all data useless. A solution to this problem is to add one additional disk drive, and record on it the "parity" of the data in the other disks. This solution is known in the art as RAID (redundant array of inexpensive disks). For example, consider the first bit of every disk drive. If the number of "1" 's is odd, a "1" would be recorded in the first bit position of the parity disk; otherwise, a "0" would be recorded. The process is similar for the other bits on the drives. In the event of a disk failure, and assuming that the identity of the failing disk drive is known, each of the bits the failed disk contained can be reconstructed from the corresponding bits of the remaining disk drives, by using the same process that was originally used to construct the parity bits, with the roles of the parity drive and the failing drive reversed.
To permit operation with a bad disk, an entire stripe must be read into memory to permit quick reconstruction of the bad disk's data from that of operational ones. Doing so is natural in many applications, since the data of the entire stripe is needed by the computer for processing. A typical size of a data chunk corresponds to several tenths of a second of video playing time. Consequently, reading a large number of chunks into a buffer memory merely because they belong to the same stripe would tie up large amounts of memory per stream for a long time. Specifically, the amount of memory per stream would be proportional to the number of disk drives forming a parity group. Since the number of streams that a server can produce concurrently is proportional to the number of disk drives, the total amount of memory required to buffer streams could increase quadratically with the number of disks. Since a server may contain tens or even hundreds of disks, this would be disastrous. Having to read an entire stripe into memory is thus a major problem for a video server.
The RAID approach has been modified in recent years to obtain greater performance. Even so, RAID and its modifications still suffer from serious disadvantages.
One modification to RAID is called staggered access. Here, the system places the data in the same manner as in conventional RAID, but the access schedules to the different disks are staggered in time. As a result, data for each stream is supplied incrementally and the buffer size per stream is a constant. One disadvantage of this approach is that it cannot effectively tolerate a disk failure. In the event of a disk failure, the approach would either require that each chunk of data be read twice (once to help reconstruct the data of the failed disk and once when its turn comes for transmission), or else the same large amount of buffer memory would be required as in the conventional RAID. Another disadvantage is the tight coupling among the access schedules to the different disks, and the persistent nature of congestion caused by coincidental user requests or small differences in the rates of different video streams. Yet another disadvantage is that rebuilding the content of the failed disk onto a new one can consume as much as the entire bandwidth of all the disks.
Another modification to RAID is known as partitioned RAIDs. Here, the M disk drives are partitioned into sets of size k+1, where k+1 divides into M. The k+1 disks of any single RAID are all accessed simultaneously, but the access schedules to the different RAIDs are staggered in time. This scheme mitigates the large buffer memory requirement if k is sufficiently small, but streaming capacity drops to (k/k+1) with a failed disk, and rebuilding can again effectively consume the entire bandwidth of the server. Also, the persistence problem mentioned for the staggered access applies here as well.
Further, all schemes with a regular data layout and no true "slack" in the choice of disks at reading time suffer from a direct translation of user-generated scenarios (the correlation between viewer actions) to storage-system scenarios (the correlation between the load on different disks). This causes congestion problems that occur to persist, and it is moreover generally impossible to prevent such problems from occurring.
Thus, there is a need for a data storage and retrieval technique that not only provides load balancing and fault tolerance, but also minimizes persistence of congestion and requires only a reasonable amount of buffering.