1. Technical Field
This invention relates to a method, system, and article for managing storage of a continuous stream of data where each data object has a value assigned thereto, wherein different data objects may have different data values. More specifically, the invention pertains to a heterogeneous storage system employed to store received data.
2. Description of the Prior Art
In a distributed computer system with shared persistent storage, one or more server nodes are in communication with one or more client nodes. FIG. 1 is a block diagram (10) illustrating one example of a distributed computer system. As shown, there are two server nodes (12) and (14), three client nodes (16), (18), and (20), and a storage area network (5) that includes one or more storage devices (not shown). Each of the client nodes (16), (18), and (20) may access an object or multiple objects stored on the file data space (27) of the storage area network (5), but may not access the metadata space (25). In opening the contents of an existing file object on the storage media of the storage device in the storage area network (5), a client contacts the server node to obtain metadata and locks. Metadata supplies the client with information about a file, such as its attributes and location on the storage devices. Locks supply the client with privileges it needs to open a file and read or write data. The server node performs a look-up of metadata information for the requested file within the metadata space (25) of the storage area network (5). The server nodes (12) or (14) communicate granted lock information and file metadata to the requesting client node, including the location of the data blocks making up the file. Once the client node holds a distributed lock and knows the data block location(s), the client can access the data for the file directly from a shared storage device attached to the storage area network.
Storage subsystems of a conventional computer system, such as that shown in FIG. 1, are configured with sufficient capacity to handle a limited quantity of data. In general, data read and write requests are processed through locks granted by the server. Deletion of data from the storage subsystem is typically done manually. Data may be deleted for various reasons.
In a data streaming environment, massive amounts of data are constantly written to the storage subsystem. Data must be periodically removed from the data storage subsystem to free up space for a comparable amount of new data. It is known in the art that the data in the streams cannot be permanently stored in the storage subsystem due to capacity issues, and as such, a protocol must be employed to delete data from the storage subsystem. One prior art known solution is a first in first out (FIFO) approach, wherein the most recent data is retained and the oldest data is deleted from the storage subsystem. Another prior art known solutions is a least recently used (LRU) approach, wherein the storage subsystem is treated as cache. Data is retained based on its last use. Accordingly, both of these prior art solutions automatically remove data from the storage subsystem based upon a preset protocol.
Another known approach for removal of data from the storage subsystem treats data differently based on its current importance to the overall system, and clusters together data that is anticipated to be deleted. More specifically, for each data object that is written to disk, a retention value function describing the projected value of the object over time is assigned to the object. The storage subsystem deletes the data with the lowest current retention function value as space is needed to make room for new data.
As demonstrated, distributed computer systems designed to handle large-scale data stream processing are evolving. With respect to streaming of data, or storage of large quantities of data in general, there are concerns with how to address capacity issues. In other words, data that is written to storage is generally maintained, and the data at some point is removed to make room for new data. However, the prior art addresses data storage and retention with respect to a homogenous storage system. It is known in the art for a storage subsystem to be non-homogenous, i.e. heterogeneous, wherein different storage media in the storage subsystem are grouped based upon characteristics thereof. Examples of aspects that make the storage subsystem heterogeneous include differences in storage controllers, connectivity bandwidths, etc. One form of a heterogeneous storage subsystem is known as tiered storage wherein the storage components are classified in a hierarchical manner. Tiered storage is a networked storage method where data is stored on various types of media based on a data storage environment consisting of two or more categories of storage delineated by differences between the categories of storage. Such differences generally include, price, performance, capacity, and function. Any significant difference in one or more of the four defining attributes can be sufficient to justify a separate storage tier.
Although heterogeneous storage subsystems are known in the art, to data they have not been employed with a distributed computer system that processes streaming of large quantities of data. In order to accommodate a heterogeneous storage subsystem, the different tiers in the hierarchy must be accommodated efficiently, so that they may be properly employed with the streams of data received for writing to storage.