A distributed database is a database in which storage devices are not all attached to a common processing unit, such as a central processing unit. Instead, multiple computers are used to implement a distributed database management system. The multiple computers may be located in the same physical location, or they may be dispersed over a network of disaggregated interconnected computers. There is typically a master node and a set of slave or worker nodes that store partitions of the distributed database.
An analytical view is a subset of data from a table or multiple tables. The analytical view may be computed by applying joins, unions, applying filters or other Structured Query Language (SQL) operations to the table or tables. The analytical view typically comprises dimensions and measures, although either dimensions or measures may be absent. The analytical view may comprise an attribute (e.g., a column name or dimension) and a measure (e.g., an aggregate, such as sum, min, max) that is defined prior to the receipt of a query and is maintained as a data unit separate from the table. An attribute can be a dimension or a measure. When data is grouped along an attribute, it becomes a dimension. When data is aggregated on an attribute, it becomes a measure. For example, in the case of the request for ‘sum(amt) by product_id’, product_id and amt are both attributes in the table. Product_id is used as a dimension and amt is used as a measure. The analytical view exposes a dimension ‘product_id’ and an aggregate ‘sum(amt)’.
Database systems use analytical views to expedite query processing. Analytical views typically materialize (e.g., cache) data resulting from computations frequently needed by queries. When a database system can prove that, semantically, it is correct to answer the query using the data in an analytical view, the system uses the pre-aggregated data from the analytical view to save processor and input/output bandwidth. This results in expedited processing of the query.
Aggregate functions like, sum, count, min and max are easily decomposable and are therefore straightforward to embed and use in analytical views. However, other aggregate functions like, number of distinct values, average, standard deviation and variance are not so easily decomposable and therefore require special processing.
Accordingly, there is a need to compute and cache complex aggregates in a distributed database, such that the cached aggregates can be combined to yield results when needed.