This disclosure relates generally to efficiently storing and processing data in a database system, and in particular to storing and processing time series data in a partitioned database system.
Time-series data is generated and processed in several contexts: monitoring and developer operations (DevOps), sensor data and the Internet of Things (IoT), computer and hardware monitoring, fitness and health monitoring, environmental and farming data, manufacturing and industrial control system data, financial data, logistics data, application usage data, and so on. Often this data is high in volume, for example, individual data sources may generate high rates of data, or many different sources may contribute data. Furthermore, this data is complex in nature, for example, a source may provide multiple measurements and labels associated with a single time. The volume of this stored data often increases over time as data is continually collected. Analytical systems typically query this data to analyze the past, present, and future behavior of entities associated with the data. This analysis may be performed for varied reasons, including examining historical trends, monitoring current performance, identifying the root cause of current problems, and anticipating future problems such as for predictive maintenance. As a result, operators are not inclined to delete this potentially valuable data.
Conventional systems fail to support the high write rates that are typical of many of these applications, which span across industries. For example, in Internet of Things (IoT) settings including industrial, agricultural, consumer, urban, or facilities, high write rates result from large numbers of devices coupled with modest to high write rates per device. In logistics settings, both planning data and actuals comprise time series that can be associated with each tracked object. Monitoring applications, such as in development and operations, may track many metrics per system component. Many forms of financial applications, such as those based on stock or option market ticker data, also rely on time-series data. All these applications require a database that can scale to a high ingest rate.
Further, these applications often query their data in complex and arbitrary ways, beyond simply fetching or aggregating a single metric across a particular time period. Such query patterns may involve rich predicates (e.g., complex conjunctions in a WHERE clause), aggregations, statistical functions, windowed operations, JOINs against relational data, subqueries, common table expressions (CTEs), and so forth. Yet these queries need to be executed efficiently.
Therefore, storing time-series data demands both scale and efficient complex queries. Conventional techniques fail to achieve both of these properties in a single system. Users have typically been faced with the trade-off between the horizontal scalability of “NoSQL” databases versus the query power of relational database management systems (RDBMS). Existing solutions for time-series data require users to choose between either scalability or rich query support.
Traditional relational database systems that support database query languages such as SQL (structured query language) have difficulty handling high ingest rates: They have poor write performance for large tables, and this problem only becomes worse over time as data volume grows linearly in time. Further, any data deletion requires expensive “vacuuming” operations to defragment the disk storage associated with such tables. Also, out-of-the-box open-source solutions for scaling-out RDBMS across many servers are still lacking.
Existing NoSQL databases are typically key-value or column-oriented databases. These databases often lack a rich query language or secondary index support, however, and suffer high latency on complex queries. Further, they often lack the ability to join data between multiple tables, and lack the reliability, tooling, and ecosystem of more widely-used traditional RDBMS systems.
Distributed block or file systems avoid the need to predefine data models or schemas, and easily scale by adding more servers. However, they pay the cost for their use of simple storage models at query time, lacking the highly structured indexes needed for fast and resource-efficient queries.
Conventional techniques that also fail to support an existing, widely-used query language such as SQL and instead create a new query language, require both new training by developers and analysts, as well as new customer interfaces or connectors to integrate with other systems.