Increased instrumentation of physical systems and computing processes has caused a substantial amount of data to be generated, collected, and analyzed. For example, applications for data center monitoring, environmental monitoring, scientific experiments, mobile asset tracking, amongst other applications produce massive time-series signals from multiple sensors. Some existing data analysis systems can execute certain queries in real-time over received time-series signals. Conventional data analysis systems, however, are unable to efficiently archive and analyze time-series signals over long periods of time.
Particularly, archiving and query processing can be challenging for conventional data analysis systems due to the sheer volume of data that can be generated by sensors associated therewith. For example, a data center for an online service provider can include tens of thousands of servers, and one hundred performance counters can be collected from each server to monitor server utilization. Additionally, for each server, ten physical sensors can be used to monitor power consumption and operation environment (e.g., internal and external temperatures pertaining to a server). Thus, a data center with fifty thousand servers can be associated with 55 million concurrent data streams and, with a 30-second sampling rate, can have fifteen billion records (about one terabyte) of data generated per day. While most recent data are used in connection with real-time monitoring and control pertaining to the data center, historical data can be used in connection with capacity planning, workload placement, pattern discovery, and fault diagnostics. Many of these tasks require utilization of time-series signals over several months. Due to sheer volume of the data, archiving such data in a raw form over several months can consume prohibitively large amounts of storage space, while executing queries over such data may be impractically slow.
Conventional data analysis/database applications address space-efficient archival and query processing separately. For example, many database systems compress data for space-efficiency; however, prior to queries being executed, the data must be decompressed. For large amounts of data, such an approach may be infeasible since decompression overhead would cause query latency to become too great for practical use.