Recent technological trends in flash media have made it an attractive alternative for data storage in a wide spectrum of computing devices such as PDA's, mobile phones, embedded sensors, MP3 players, etc. The success of flash media for these devices is due mainly to its superior characteristics such as smaller size, lighter weight, better shock resistance, lower power consumption, less noise, and faster read performance than disk drives. While flash-memory has been the primary storage media for embedded devices from the very beginning, there is an increasing trend that flash memory will infiltrate the personal computer market segment. As its capacity increases and price drops, flash media can overcome adoption as compared with lower-end, lower-capacity magnetic disk drives.
Current technology allows for running a full database system on flash-only computing platforms and running a light-weight database system on flash-based embedded computing devices. However, flash has fundamentally different read/write characteristics from other non-volatile media such as magnetic disks. In particular, flash writes are immutable and once written, a data page must be erased before it can be written again. Moreover, the unit of erase often spans multiple pages, further complicating storage management. With current practices, these unique characteristics can be hidden from applications via a software layer called the Flash Translation Layer (FTL), which enables mounting and using a flash media like a disk drive. Using the FTL, conventional disk-based database algorithms and access methods will function correctly without any modification.
However, since the FTL needs to internally deal with flash characteristics, many algorithms designed for magnetic disks are not likely to yield the best attainable performance. For example, algorithms that overwrite data in place may work well with magnetic disks, but will perform poorly with flash media. Thus, in order to make a flash-based storage systems efficient, many algorithms need to be redesigned to take flash characteristics into account.
As a specific example, consider maintenance of a very large (e.g., several gigabytes) random sample of an evolving data stream. In this context, random sampling is a approximation technique used in many applications including data mining, statistics, and machine learning. In many scenarios, the sample needs to be very large to be effective. For example, when the underlying data has a high variance, a very large sample is required to provide accurate estimates with suitably high confidence. Moreover, variance in the data is often magnified by standard database operators like selections and joins, increasing the size of the sample required to ensure a target approximation accuracy.
Another example is sensor networks, where each sensor collects too many readings to store all of them in its limited storage, and transmitting all its readings to a base station expends too much of its limited battery. In such a case, it is desirable for the sensor to maintain a random sample of its readings. Operatively, queries can be pushed to the sensor nodes, and answered using the sample points falling within a specified time window. Humans or data mules traveling next to a sensor node can be used to retrieve its sample for offline data mining or statistical analysis purposes; while such mules minimize the energy cost of retrieving data, they typically pass by a sensor node far too infrequently to collect more than a sample of its readings. It is desirable that the sample maintained on the sensor node is large (in many cases, as large as possible) because (i) scientists deploying the sensors usually want to collect as much data as possible, and (ii) a very large sample helps ensure that there will be a sufficient number of sample points within every time-window of interest.
However, currently deployed sampling algorithms are lacking since they do not offer one or more of the following properties, the algorithm is suitable for streaming data, or any similar environment where a large sample must be maintained online in a single pass through the dataset; the algorithm must be efficient, in terms of latency or energy, on flash; i.e., it should be flash-aware and it should avoid operations (e.g., in-place updates) that are expensive on flash; and the algorithm should be tunable to both the amount of flash storage and the amount of standard memory (DRAM) available to the algorithm. Thus, the algorithm can be tunable to a specified bounded sample size, and DRAM-constrained embedded devices can use the algorithm, while less constrained devices can take advantage of the larger available DRAM.
For example, reservoir-sampling and geometric file are two algorithms for maintaining a bounded size sample. Both can be implemented to maintain a sample on flash media, but both require many in-place updates on flash and, hence, are very slow and energy expensive in flash. Moreover, geometric file has a large DRAM footprint, and hence is not suitable for most embedded systems.
From the foregoing it is appreciated that there exists a need for systems and methods to ameliorate the shortcomings of existing practices.