1. Field of the Invention
The present invention generally relates to a method and apparatus for optimizing storage in a stream-based distributed computer system, and more particularly to a method and apparatus for maximizing the value of retained data in a storage system incorporating retention function-based data detection.
2. Description of the Related Art
Computer storage systems for storing inputted data are commonly known. However, not all commonly known computer data storage systems are designed to handle streaming data applications. Distributed computer systems, which for purposes of the present application refer to storage systems including multiple storage units (e.g., “vats”) that are coupled together, have been specifically designed to handle streaming data applications. However, distributed computer systems designed to handle very large-scale (e.g., on the scale of hundreds of thousands of incoming streams of data) are in their infancy.
Highly scalable distributed computer systems that may handle complex applications involving large quantities of streaming data are possible. In particular, distributed computer systems, including tens of thousands of processing nodes, may have the capability of concurrently supporting hundreds of thousands of incoming and derived data streams and having storage subsystems with a capacity of multiple petabytes.
Even at these large sizes (e.g., a storage capacity of multiple petabytes), the distributed computer systems will not be able to handle all of the streaming data. That is, the processors cannot handle all of the streaming data and will be fully utilized. Additionally, the offered load will far exceed the processing power capabilities of the systems and the storage systems will be over capacity.
FIG. 1 depicts a conventional distributed storage subsystem 100 for a streaming data system described above. Incoming and derived streams 102 of data are processed by interconnected applications on a distributed set 104 of processing nodes 106a-106c. These processing nodes 106a-106c are interconnected via a network 108, and connected, via a storage network, to a collection 110 of storage vats 112a-112c. Each vat 112a-112s may include an individual file system.
Storing streaming data presents a challenge that is qualitatively different from that of conventional systems (i.e., systems including non-streaming input data), because of the huge quantities of primal (incoming) and processed data, which needs to be written to disk. The storage subsystem of a conventional computer system is typically configured with sufficient capacity to handle the data. Deletion of data is typically done manually. But in a streaming environment, massive amounts of data are being written constantly. No reasonable amount of storage will be able to keep up with the incoming and derived streaming data, and therefore very little of the data can be kept permanently. In fact, one can assume that in steady state, the storage subsystem will constantly be more or less fully allocated.
Thus, as new data arrives, an equivalent amount of old data must be flushed (deleted). Since the deletion operations will happen at great rates, they cannot be done manually (as is done in conventional systems). Given that a typical distributed computer system for steaming data applications will run continuously, there will be no ‘down’ time to fix problems. Therefore, any attempt to optimize the storage of the streaming data must be done in real time. Therefore, conventional storage techniques, where data is deleted manually, are not ideal for a streaming data system.
Stored data objects in streaming systems are typically regarded as immutable once created. Thus, the storage subsystem has the roles of handling initial writes, potentially multiple reads, and, finally, deletion of the data.
One solution to the automatic deletion of data might be to keep the most recent data, displacing the oldest data first. This is commonly known as the first in, first out (FIFO) approach. Another idea is to retain data based on the time of its last usage (initial write or subsequent read). This is commonly known as the least recently used (LRU) approach, effectively treating the entire storage subsystem as though it were a huge cache. Each of these techniques is a conventional technique that has been used in non-streaming data systems.
However, neither of these concepts will work well for streaming data applications, because these approaches do not optimize the value of data being retained.
Accordingly, there is a need for a more sophisticated approach. A conventional approach for handling streaming data has been developed that treats data differently based on its current importance to the overall system. For example, the headlines of news articles from CNN might be worth storing for longer periods of time than the actual body of the news articles.
The approach is to define for each data object to be written to disk a function describing its projected value over time (i.e., a so-called time value of information objects). This retention value function is typically non-increasing, within a range from 0 to 100, though neither of these properties is strictly required. The storage subsystem then deletes the data with the lowest current retention function values as space is needed. This design results in a relative rather than absolute notion of value. That is, the retention function value at a given time does not guarantee the amount of time the data object has left before being deleted. The overhead associated with such a deletion method is manageable, at least as long as the number of such functions is not too large.
The creation of the retention value functions is generally the responsibility of the application, and defined at a much coarser level than that of the data objects themselves. Each data object belongs to a so-called retention class. All data objects in a particular retention class have retention values determined by the same retention value function. Thus, retention to classes are the atomic unit on which retention value functions are defined. Different data objects within a retention class can have varying ages, and therefore have different values at any given time.
Occasionally, it may be useful to modify a particular retention value function, or to remove certain data objects from a retention class and add them to another, thus changing the retention value functions for those objects. Storage class retention function assignments and data object retention value function modifications are the job of analytics, and these are orthogonal to the present embodiment.
The above-described technique has been used in an environment including a single storage unit (e.g., vat). In an individual vat, space is essentially fluid, and deleting existing data frees up space for a comparable amount of new data. As a practical implementation, one can approximate this flow balance concept via a waterline. The waterline is defined for a given vat and time, so that data whose value is below this waterline will be deleted. Data whose value is at or above this waterline will be retained. The waterline rises and falls over time, depending on the amount of new data that must be added to the vat.
However, the notion of waterlines takes on a much different character when there are multiple vats (e.g., as in the distributed storage system 100 depicted in FIG. 1). Absent a global optimization strategy, the waterlines of the various vats may drift and become quite different over time. This may result in the deletion of higher valued data than would be removed in a scenario with one global vat with a single waterline. It would therefore clearly be useful if the waterlines of the various vats were identical, or, more precisely, as close as possible to equal, given the other constraints in the system.
Therefore, it is clear that a novel and very effective optimization method is necessary for a storage component of a distributed computer system to handle large scale stream processing applications.