The invention relates to a system for retrieving blocks of data, such as audio and/or video data, from a storage medium and supplying the blocks in the form of a plurality of at maximum Nmax data streams to users, wherein each data stream has an identical maximum consumption rate of Rmax data elements per second and the storage medium has a bandwidth of at least Nmax*Rmax data elements per second; the storage medium comprising a plurality of storage units.
A system of this kind is used in a multimedia server and, more specifically, in a video on demand or near video on demand server. A general requirement in such systems is to supply a continuous, un-interrupted stream of data to each active user. Typically, data is read from a conventional storage medium, such as hard disks, which are arranged in a disk array, such as a RAID system. In general, a distinction can be made between a fixed consumption rate and a variable consumption rate system. In a fixed consumption rate system data is, typically, supplied to a user as a fixed rate data stream. Usually, the rate is identical for each stream in the system. An example of such a system is a near-video-on-demand system, wherein a number of films can be played in parallel and the same film may be played several times in parallel, where regularly, for instance, every five or fifteen minutes, a new copy of the same film is started. In a variable consumption rate system the rate at which a user consumes data varies over time. Typically, a maximum consumption rate can be defined for each data stream. In practice, usually an identical maximum consumption rate is used for all streams, although it may be possible to efficiently support streams with different maximum consumption rates ( e.g. one maximum for an audio stream and another maximum for a combined video and audio stream). Variable consumption rate systems are, for instance, used for systems which support VCR-like functions such as pause or slow motion, or systems which use a data compression scheme with a variable bit rate, such as MPEG-2.
To supply data to a user as a continuous data stream, special scheduling schemes for reading data from the disks are required with an appropriate scheme for temporarily buffering the read data before the data is supplied to the user. For a fixed consumption rate system, typically, at fixed regular intervals for each stream a fixed amount of data (sufficient to last one period) is read and stored in the buffer. Within a variable consumption rate system, different streams may empty a predetermined block of the buffer at different moments. Typically, during an interval only data is read for streams whose buffers have room for a block of data. Other streams are skipped. As a consequence, the duration of the interval for reading data is variable, bounded by the situation in which all active streams require a new data block.
To guarantee a large disk bandwidth, which is required when all data streams consume at the maximum data rate, the data is usually striped across all disks in the array. Particularly for variable data rate systems this is achieved by partitioning each data block over all disks, such that a request for reading this block implies a disk access to each of the disks. As a result, the load on the disks is optimally balanced.
In general the number of disks in the disk array is determined by the required bandwidth in view of the number of data streams which are supported in parallel. As such, the number of disks linearly grows with the maximum number of data streams. In order to ensure that the effectiveness of accessing a disk remains more or less constant, the size of a block read from an individual disk for one data stream should remain the same (reading too small blocks of a disk increases the overhead, particularly if the size drops below the size of a track). As a consequence, the size of the accumulated block, which is striped over all disks, grows linearly with the number of disks and, as such, with the maximum number of data streams. This results in the size of the buffer also growing linearly with the maximum number of data streams. Since for each data stream a buffer is required, the combined effect is that the total size of the buffer depends substantially quadratically on the maximum number of data streams, making the costs of the buffer a dominant factor, particularly, for systems supporting a large number of data streams. A further negative effect is that the response time of the system for allowing a new data stream to become active increases substantially linearly, since the time interval required to re-fill the buffers increases linearly and during such interval no new users can enter the system.
It is an object of the invention to provide a system of the kind set forth in which the buffer costs grow substantially linearly with the maximum number of supported data streams. A further object is to provide such a system wherein the response time remains substantially the same regardless of the maximum number of supported data streams.
To achieve this object, the system according to the invention is characterised in that a predetermined selection of blocks is stored multiple times in the storage medium by individual blocks of the selection being stored in its entirety in at least two different and randomly selected storage units; and the system comprises:
a scheduler for controlling reading of blocks for the data streams from the storage medium by, for each block, selecting from a corresponding set of all storage units in which the block is stored one storage unit from which the block is to be read and assigning a corresponding read request to the selected storage unit; the scheduler being operative to select the storage unit from the set such that the load on the plurality of storage units is balanced; and
a reader for, in response to a block read request, reading the corresponding block from the assigned storage unit for supply in the corresponding data stream to a user.
By ensuring that a block is stored in at least two different storage units, the scheduler has a choice in selecting from which of the storage units the block is read during play-back. This makes it possible for the scheduler to choose the storage units in such a way that the load on the storage units is balanced. By storing the blocks in randomly selected storage units a relatively even load can be achieved during playback even for variable rate systems for which it is otherwise difficult or even impossible to predict how the load on individual storage units will develop in time for the active streams. In a simple system, all blocks of all titles are stored twice in the system, i.e. in two different storage units of the system. The selection of blocks may alternatively be a part of all blocks, where the remaining blocks are stored only once.
The solution offered by the system according to the invention benefits from the insight that, unlike the bandwidth, capacity of the storage medium is not a major determining factor and increasingly will become less of a determining factor since capacity of hard-disks grows on average at a higher rate (typically 60% a year) than the bandwidth (typically a growth of 40% a year). Assuming that in two years time a typical disk will have a storage capacity of 25 Gbyte and a guaranteed data rate of 85 Mbits/sec, the following two scenario""s can be envisioned. In a first scenario, a video on demand version of the system according to the invention is used to offer a range of up to 20 films, like in a cinema, to a maximum of a 1000 users. The films are encoded using MPEG2 at an average rate of 4 Mbits/sec. Assuming that on average a film has a length of 90 min., a total storage capacity of 90*60*4.106=21.6*109 bits=2.5 Gbyte per film is required, resulting in a total required storage capacity of approximately 50 Gbyte. If all blocks are stored twice, the capacity will have to be approximately 100 Gbyte, corresponding to 4 or 5 disks. To be able to support simultaneously 1000 users, a total guaranteed bandwidth is required of 4000 Mbits/sec, which corresponds to 47 disks. Only if in such a system it is required to store over 235 films the storage capacity of the disks becomes relevant. In a more moderate second scenario in which the system according to the invention is used to offer up to a hundred users a choice of films, as is typical for a hotel, with respect to the bandwidth five disks are required. With full duplication of all blocks, the five disks can store 23 films which is more than is usually offered in such a system.
The invention is further based on the insight that, although data blocks are stored multiple times, in fact in a typical system less storage units are required in the system according to the invention than in a conventional system. As already shown for above scenarios, bandwidth is the limiting factor for most practical systems and increasingly will be even more so. By no longer striping data blocks across the individual storage units, but instead reading the entire block from one disk, the efficiency in reading data from the storage unit is improved. In the first scenario an efficiency improvement of just over 2% already results in saving a disk.
A further advantage of the system according to the invention lies in the improved robustness of the system. As an example, if one out of twenty disks fails and all blocks are duplicated, the scheduler will on average for one out of twenty blocks not be able to choose from which unit the block should be read. It will be appreciated that as a result the average load on a storage unit will increase by 5.3% and the maximum load on one of the remaining storage units will also increase with a substantially same percentage. In most circumstances where the system is not used by the maximum number of users this will give ample opportunity to replace the failed disk. No other measures such as parity disks are required.
Moreover, the extendibility of the system increases. If additional bandwidth (or storage capacity) is required an additional disk can be added to the system. Next, blocks need to be moved from the already present disks to the new disk. The blocks to be moved may be selected randomly. Typically, the number of blocks to be moved will substantially be the average number of blocks stored on a disk. In the system according to the invention, the impact on the operation of the system is very low. In a 50 disk system only 2% of the blocks need to be moved. All other blocks are entirely unaffected. Also, the moving operation can be performed whenever capacity is available to move a block (i.e. the operation can be performed in the background while the system is operational). As blocks are being moved to the new disk, slowly the capacity of the system increases until finally the desired number of blocks have been moved to the new disk. In contradistinction, for a typical system wherein blocks are striped over all disks, all disks need to be accessed and blocks may even need to be re built in order to store data on the new disk. Such a major operation must usually be performed while the system is off-line.
A disk based storage server for 3D interactive applications, such as an architectural walk through a town like Los Angeles, is described in xe2x80x9cRandomized data Allocation for Real-time Disk I/Oxe2x80x9d, Compcon ""96 (41st IEEE Computer Society International Conference, Santa Clara, Feb. 25-28, 1996). In this system which considers very large 3D models, at least on the order of a tera byte, the nature of the I/O stream with respect to upper bounds on delay and resource utilisation is very different from video on demand systems where these aspects are (largely) predictable. For this system an unpredictable action of the user determines whether an entirely new portion of the model becomes visible and should be presented to the user. To be able to provide a high peek bandwidth required for presenting a new portion of the model, a data block is typically striped over seven disks forming a group of seven blocks, each occupying one full track on a disk. The group of seven disks are randomly selected from the entire set of available disks. One of the seven blocks is a parity block. The random distribution of the groups results in a statistical variation in the queue lengths. The variability can be decreased by reading only g-1 blocks of a group of g blocks. If the unread block is a data block, the parity block is used to reconstruct the unread block.
The measure as defined in the dependent claim 2 describes a simple scheduling scheme wherein requests for fetching data blocks from the storage medium are handled individually by the scheduler and, typically, in the sequence as they arrive at the scheduler. For each request the scheduler determines, based on the then actual load resulting from already assigned block read requests, to which storage unit the request should be assigned, being the storage unit having the lowest load of all storage units which store the block to be read.
The measure as defined in the dependent claim 3 balances the load by distributing the requests for a group of blocks to be read. This allows for a more even load compared to assigning the requests on an individual basis, where already assigned requests can no longer be influenced. As an example, assigning requests individually may result in the following situation. A request may be assigned to storage unit A which is chosen from two possible storage units A and B, having equal load. The load on a storage unit C is one higher than the initial load on A and B. For a next request a choice has to be made between storage units A and C. Since both A and C now have equal load, what ever choice is made it will always result in increasing the load on one of the storage units. By dealing with both request together in one group, the first request can be assigned to B and the second request to A, resulting in a lower maximum load on the three units. In most systems, the storage units are read in a synchronised manner (i.e. a reading operation is started for all disks at the same time, where the reading itself is performed in one sweeping movement of the head of the disk), implying that an increase of the load of the storage unit with the highest load will result in a longer interval for reading from the storage units. This results in a longer response time for accepting new users.
The measure as defined in the dependent claim 4 describes a simple scheduling scheme for balancing the load for a group of requests wherein requests for fetching data blocks from the storage medium are handled individually by the scheduler and, typically, in the sequence as they arrive at the scheduler. For each request the scheduler determines, based on the then actual load resulting from already assigned block read requests, to which storage unit the request should be assigned, being the storage unit having the lowest load of all storage units which store the block to be read. Furthermore, requests currently being dealt with or already definitely assigned to storage units (e.g. already issued to a hardware read request buffer storing requests for blocks to be read in a next sweeping operation of the disk) do not need to be considered. By ensuring that the load is balanced evenly within each successive group, a relatively even load on the storage units will be maintained as a result of the successive groups.
The measure as defined in the dependent claim 5 allows obtaining a very even load on the storage units. Simulations have shown that by using the initial sequential assignment as described in claim 4 followed by just one round of a sequential reassignment as described in claim 5, the distribution is very close to an optimum. It will be appreciated that the condition for performing a next iteration may simply be a counter (like performing only one re-assignment step), but may also depend on, for instance, the quality of the distribution which has been obtained so-far in view of the maximum load and the average load. The measure as defined in the dependent claim 6 gives a simple criterion being whether the previous iteration step has resulted in an improvement or not.
The measure as defined in the dependent claim 7 describes how based on an analogy with a maximum flow problem an optimum distribution of the load can be achieved.
The measure defined in the dependent claim 8 provides an effective and simple way of achieving an optimal load.
The measure defined in the dependent claim 9 describes that titles are associated with a multiplication factor indicating a level of multiplication of the title. Advantageously, more than one different multiplication factor may be used in the system. For instance, a group of most popular titles is assigned a high multiplication factor (e.g. higher than 2 indicating that for the titles more that twice the amount of blocks are stored than from part of the title), whereas for not regularly accessed films, like old favourites, a low multiplication factor is used (e.g. between using hardly any more blocks for storing the title to using 1.5 times the number of blocks). Simulations have shown that for films with a similar popularity already good results can be achieved with a multiplication factor of approximately 1.75 (i.e. 75% of the blocks during playback are available twice). It will be appreciated that by multiplying the most popular titles an overall multiplication factor of 75% durinh playback is already achieved by duplicating substantially less than 75% of the titles.
It will be appreciated that the title-specific multiplication may also be applied in other systems with some form of storing blocks multiple times. Such a system could be described as a system for retrieving blocks of data, such as audio and/or video data, from a storage medium and supplying the blocks in the form of a plurality of data streams to users; the storage medium comprising a plurality of storage units; a predetermined selection of blocks being stored multiple times in the storage medium by individual blocks of the selection being stored in at least two different and non-overlapping groups of at least one storage unit; the blocks relating to a plurality of titles, each title comprising a sequence of data blocks and being associated with a predetermined multiplication factor; the multiplication factor of a title relating to a ratio of the number of blocks stored in the storage medium for the title and the number of blocks of the title; the system comprising:
a scheduler for controlling reading of blocks for the data streams from the storage medium by, for each block, selecting from a corresponding set of all groups of at least one storage unit in which the block is stored one group from which the block is to be read and assigning a corresponding read request to the selected group; and
a reader for, in response to a block read request, reading the corresponding block from the assigned group of at least one storage units for supply in the corresponding data stream to a user.
It will be appreciated that a block may be stored in one single storage unit or stored in a group of more than one storage unit (e.g. striped across a few storage units or the storage medium in order to increase bandwidth). If the block is stored multiple times, then each data element of the block is stored in at least two different storage units (no full overlap). In this way redundancy is achieved, which increases robustness but also allows balancing during playback. If, for instance, a block is striped over a group of two disks (i.e. stored as two sub-blocks SB1 and SB2) and is stored three times, then preferably each of the two different sub-blocks is stored in three different storage units. In this case, if one storage unit fails each subblock is still available from at least two storage units. Preferably, the group of storage units is selected xe2x80x98randomlyxe2x80x99 increasing the load balancing during playback. However, the system can also offer advantages for highly non-random storage strategies. A frequently used non-random storage strategy is the so-called round-robing method. In such a system blocks of one title are stored on successive disks in the sequence of playback. As an example, for a system with seven main disks the first block of a title may be stored on disk 1; block 2 on disk 2, . . . , block 7 on disk 7, block 8 on disk 1, block 9 on disk 2, etc. achieve redundancy in such a system, typically a parity disk is added (in the example disk 8), where the parity disk stores for each cycle (block 1 to block 7; block 8 to block 15) one parity block calculated over the blocks of the cycle. If all main disks function a cycle requires 7 successive disk accesses (one access to each of the disks). If, however, one of the main disk fails, six normal accesses are still required, with seven additional accesses for the block which can not be read directly (one access to the parity disk and six accesses to the still operational main disks). Consequently, the performance of the system in such a situation almost halves. Alternatively, all six normally accessible blocks of a cycle may be kept in memory and used to reconstruct the block from the unavailable disk in combination with the parity block. This, however, requires a significant increase in memory requirements, typically resulting in the system being able to service considerably less users. By storing blocks multiple times, the robustness of the system is increased. Preferably, the same block is stored optimally xe2x80x98out-of-phasexe2x80x99 to reduce the worst case access delay. As an example, in above described system if the blocks are each stored twice, preferably block 1 is also stored on disk 3 (or disk 4), reducing the worst case delay for a new user to almost half the cycle duration instead of the duration of one cycle. If one of the main disks fails, all blocks are still available. Two of the seven blocks of a cycle will only be available on one disk. The normal round-robin reading scheme can no longer be used. Preferably, for each block it is decided based on the actual load from which storage unit the block should be read. As has been indicated above for a system with random storing of block, balancing is not too severely affected if not all blocks are available multiple times. As such the performance degradation will be slight compared to the traditional round-robin scheme. Storing a block multiple time may increase the storage costs. Such an increase may be (partly) offset by not using separate parity disks. For instance, in a round-robin style system the 20% most popular titles (usually covering at least 80% of the use) may be stored twice whereas all other titles are only stored once. No parity disk is used any more. If a disk fails, 80% of the users is not affected. The 80% least popular titles need to be retrieved from a back-up storage, such as a tape, possibly affecting 20% of the users. In such a case it may not be possible to serve some or all of these titles until the failed disk has been replaced and rebuild. It will be appreciated that the concept of multiplication relates to one coherent storage medium wherein the storage units are directly accessible and redundancy and load balancing occurs exclusively for that group of storage units. An example of such a storage medium is a disk array with one group of disks, connected via one shared bus to a controller. It will be appreciated that not included is a combination of such a disk array and a back-up storage, such as a tape, which also stores blocks of the titles. In a hierarchical disk array, wherein smaller disk arrays are grouped to form a large disk array, the storage medium relates to either the smaller disk array or the larger whichever is used to store all blocks of one title.
The measure defined in the dependent claim 10 illustrates that a title may be stored multiple times in its entirety (e.g. 2 or 3 times) and that alternatively or additionally a percentage of individual blocks of the title may be stored multiple times. As an example, a multiplication factor of 1.75 may represent that the title can be seen as a sequence of non-overlapping successive groups of four blocks, where three blocks of each group (e.g. the first three blocks) are stored twice and one block (e.g. the last block) is stored once. Similarly a multiplication factor of 2.5 may represent that every xe2x80x98oddxe2x80x99 block of the title is stored twice and every xe2x80x98evenxe2x80x99 block is stored three times.