A distributed database is a database in which storage devices are not all attached to a common processing unit, such as a central processing unit. Instead, multiple computers are used to implement a distributed database management system. The multiple computers may be located in the same physical location, or they may be dispersed over a network of disaggregated interconnected computers. There is typically a master node and a set of slave or worker nodes that store partitions of the distributed database.
Distributed databases are commonly used to support Business Intelligence applications. Business Intelligence data is typically modeled using a star or a snowflake schema where fact table(s) keep track of facts, such as itemized sales. Dimensions describe attributes of the facts, e.g., a ‘customer’ dimension table describes the details of the customer, a ‘store’ dimension table describes stores etc. Fact tables are typically wide tables with a large number of columns and rows. Dimension tables are typically smaller in size.
The star schema gets its name from the physical model's resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star's points. FIG. 1 is an example of a star schema with a fact table in the center and eight different dimension tables surrounding the fact table. FIG. 2 illustrates a specific star schema with a fact table named “sales’ in the middle and dimension tables respectively named “Products”, “Customers”, “Promotions”, “Period” and “Stores” surrounding the fact table.
In order to create a normalized schema, dimension tables have a primary key which relates to a foreign key in a fact table. For example, a store dimension has a primary key ‘s_store_sk’ that relates to the foreign key ‘ss_store_sk’ in a sales fact table. This relationship is defined in a database using a primary-foreign key constraint (pk-fk constraint) or a referential integrity (RI) constraint. The pk-fk relationships are hard to maintain in distributed environments when the fact tables and dimension tables are periodically updated with data.
Some database systems use ‘informational’ constraints to allow the databases to perform certain optimizations assuming the pk-fk holds and leaving the integrity of the system to be enforced by the user.
In a Structured Query Language (SQL) query, the fact table foreign key equi-joins with the dimension's primary key. The pk-fk constraint or referential constraint guarantees that the ‘pk’ is unique in the dimension table and it uniquely determines other attributes in the dimension table. For example, product_id in a products table uniquely determines product name and product description. The existence of a foreign key guarantees that a row in the fact table with product id say, ‘pid1’, will necessarily find one exact matching row in the dimension table ‘product’ with product id of ‘pid1’. There may be multiple rows in the fact table with the same product id.
FIG. 3 is an example of a snowflake schema with a fact table in the center, surrounded by dimension tables, which are surrounded by additional tables. Reference herein to a star schema also implicates a snowflake schema.
The Transaction Processing Performance Council (TPC) has established a Decision Support (DS) benchmark known as TPC-DS. FIG. 4 illustrates a TPC-DS benchmark snowflake schema.
FIG. 5 illustrates a schema that supports fact table to fact table joins. A typical fact-to-fact join involves each fact table aggregating its own measures and joining with its sets of dimensions and then these aggregated fact tables are joined with each other, typically using common dimensions.
Queries based on star or snowflake schema are notoriously expensive to compute for two reasons. First, the amount of data in fact tables is typically huge. Second, since the queries join on the primary-foreign keys, the joins are not reducing. That is, the number of rows in the result of the query is the same as the number of rows in the fact table that survives local filtering on the fact table. This is unlike non-star-schema joins where in the majority of the cases, the number of rows in the result of the join is less than the number of the rows in the bigger of the tables.
Given the high concurrency that Business Intelligence applications demand—typically on the order of 100's to 1000s of concurrent sessions—it is not practical to pay the cost of computing these expensive joins upon every BI application (dashboard, visual) interaction. To alleviate this, BI systems rely on two core strategies. First, joins are pre-materialized to provide fast query response. The pre-materialization is sometimes called cubes or lenses or aggregate tables. Advantageously, pre-materialization pays the cost of the join once, so all BI dashboard-time queries can be executed against a single table. On the other hand, pre-materialization has a high initial cost of computation, which adds to a delay before the BI application can be used. This is typically called “cube building time”, which adds latency to the application. Another problem is latency due to maintenance. If new data is added to any underlying table, typically the cube needs a re-computation. If a user finds that a cube does not model the necessary dimensions or measures, there is a further cost to modifying the definition of the cube and materializing it.
Another prior art technique to avoid the expense of joins is the caching of query results. This has the advantage of fast query time during the BI interactions. However, this is severely limited since query result caching does not address BI interactions like drill down, roll ups, and filters. Without that functionality a query result tier is not sufficient for BI applications to be used in a production setting.
In view of the foregoing, there is a need for improved techniques for executing joins for star and snowflake schema in big data applications.