With the proliferation of technology, we face an exponential growth of information and data that must be stored, analyzed and acted upon by computers. It is generally believed that data volumes grow at a compounded annual growth rate (CAGR) of approximately sixty-percent. This growth results in a doubling of data volumes approximately every two years.
Computers and computer related infrastructures have evolved to keep pace with this exponential data growth. Over three decades now, it has been shown that large collections of inexpensive computers can be assembled, and their collective power can be brought to bear on large and complex problems.
These kinds of assemblages of computers are often based on the “Shared Nothing” (SN) architecture. In this architecture, a collection of individual computers (called a node), each containing CPUs, Disk Storage, Dynamic Memory, Network Interface Controller(s), and some software programs is first assembled. The CPU on each node, and any software programs that are run on that node have complete and direct access to all information that is on that node but have no direct access to any information that is resident on another node.
It has been demonstrated that SN architectures can be efficiently scaled up to hundreds, thousands and tens of thousands of nodes. For some kinds of data processing, these architectures can demonstrate linear or very close to linear scalability. In other words, if a system consisted of M identical nodes, and another system consisted of N identical nodes, and M>N, the system with M nodes could perform (M/N) times more work in a given interval of time compared to the system with N nodes. In some cases this means that the system with M nodes could complete a piece of work M/N times faster than the system with N nodes.
SN database systems, called “Parallel Database Management Systems” (PDBMS) achieve their scalability and performance by having a large number of nodes each perform a part of the processing, on a subset of the problem, in parallel, and at the same time.
In such systems, tuples of each relation in the database are partitioned (declustered) across disk storage units attached directly to each node. Partitioning allows multiple processors to scan large relations in parallel without needing any exotic I/O devices. Such architectures were pioneered by Teradata in the late seventies, by Netezza in the 2000's, and by several research projects.
SN architectures minimize interference by minimizing resource sharing and contention. They also exploit commodity processors and memory without needing an incredibly powerful interconnection network.