A datacenter, or a system of interconnected datacenters, may use various types of data storage to store large amounts of data. The data may be made accessible to a customer and/or internal systems using an interface that accepts queries from a query language, such as the Structured Query Language (“SQL”). To access the data, the customer and/or internal systems may provide an SQL query to the interface residing on an intermediate server that facilitates access to the data residing in the data storage. A typical SQL query that the customer may pass to the intermediate server is a SQL query that includes the JOIN operation. The SQL query may contain additional clauses that specify further processing of the result of the join operation and other clauses, such as “WHERE” clauses, “GROUP BY” clauses, “HAVING” clauses, “ORDERED BY” clauses, and other such clauses as described by the SQL standard.
The data storage may be organized as part of a relational database. More particularly, the data of the relational database may be organized as a hierarchy. For example, suppose there is a set of database tables A, B, C, and D, where A is the parent, or root, of the hierarchy, B and D are children database tables of A, and C is a child database table of B. More particularly, each record of B may be a child of exactly one record of A, each record of C may be a child of exactly one record of B, and each record of D may be a child of exactly one row of A. In other words, a data set that conforms to the aforementioned hierarchy may be imagined as a set of hierarchical trees, each of which have a record of A at the top.
A customer and/or internal system may have access to this hierarchical data structure via the intermediate server discussed above. More particularly, the customer may perform one or more SQL operations on this hierarchical data structure, such as a SQL query that includes the “JOIN” operation. One example of a query that the customer may pass to the intermediate server may be “SELECT * FROM A [LEFT OUTER] JOIN B ON <predicates B> [LEFT OUTER] JOIN C ON <predicates C>,” where, “[LEFT OUTER]” indicates that the “LEFT OUTER” operation is optional, and <predicates X> indicates a series of Boolean conjuncts on a table X and all preceding tables in the JOIN operation (e.g., tables A and B, or tables A, B, and C) similar to those described in the SQL standard. The query may also include a “WHERE” clause, where the predicates of the “WHERE” clause are a series of Boolean conjuncts on table A, table B, and/or table C. More specifically, the brackets around “LEFT OUTER” indicate that the “LEFT OUTER” instruction may be specified, but that it may be omitted.
As is understood in the art, a “JOIN” operation, e.g., “A JOIN B ON <predicates B> JOIN C ON <predicates C>,” combined with a “WHERE” clause, “WHERE <predicates WHERE>,” produces all combinations of records (record1, record2, record3), such that record1 is a record of table A, record2 is a record of table B, and record3 is a record of table C, and all Boolean predicates <predicates B>, <predicates C>, and <predicates WHERE> are satisfied. When specified for table B, the “LEFT OUTER JOIN” operation will produce a row for table A and table C regardless of whether there are rows in table B that satisfy “<predicates B>”. As understood in the art, when specified for table C, the “LEFT OUTER JOIN” operation will produce a row for table A and table B regardless of whether there are rows in table C that satisfy “<predicates C>”.
Normally, one customer and/or internal system issuing this a single query with a “JOIN” operation may not detract from the resources (e.g., available memory, available network bandwidth, available read/seek times, available processing cycles, etc.) of the datacenter. However, where a datacenter services thousands of customers, which may themselves serve thousands of clients, the resources of the datacenter quickly vanish when hundreds or thousands of these JOIN queries are being received and performed.