Distributed computing refers to hardware and software systems containing multiple processing elements and concurrent processes running under loose control. In particular, in distributed computing, a program is split into parts that run simultaneously on multiple computers communicating over a network. Shared nothing architecture distributed computing refers to a computing architecture where each node in the network is independent and self-sufficient. Such a system stands in contrast to a large amount of centrally-stored information, such as in a database or data warehouse.
A query processing task to be performed in a distributed environment is split into operators. An operator is a unit of work to complete a sub-task associated with the task. The unit of work may be an operational code (opcode) or set of opcodes. An opcode is the portion of a machine language instruction that specifies an operation to be performed. The specification and format of an operator are defined by the instruction set architecture of the underlying processor. A collection of operators forms a data processing operation that executes in a pipelined fashion.
An operator works on objects. As used herein, an object refers to operands or data that are processed by an operator. In a distributed computing environment, objects are commonly processed as batches, partitions, keys and rows. A batch is a large collection of data (e.g., 1 billion rows). Partitions define the division of data within a batch. Keys correlate a set of data within a partition. Each key has an associated set of data, typically in one or more rows, also called tuples.
Shared nothing architecture distributed computing holds great promise because of its scalability. However, the sizes of the batches of data handled in such environments creates many challenges with respect to storing and accessing the data. In addition, processing queries for the data is challenging. Accordingly, it would be desirable to provide improved data storage, access and query processing in a shared nothing architecture distributed computing system.