The concept of massively parallel processing (MPP) is the coordinated processing of a program by multiple processors, with each processer working on different parts of the program. The processors communicate with one another to complete a task with each of them using its own operating system and memory resources.
An MPP database system is based on shared-nothing architecture, with the tables of its databases partitioned into segments and distributed to different processing nodes. There is no data sharing among the processing nodes. When database queries arrive, the work of each query is divided and assigned to one of the processing nodes according to a data distribution plan and an optimized execution plan. The processing entities in each processing node manage only their portion of the data. However, these processing entities may communicate with one another to exchange necessary information during their work execution. A query may be divided into multiple sub-queries, and the sub-queries may be executed in parallel or in some optimal order in some or all the processing nodes. The results of the sub-queries may be aggregated and further processed, and subsequently more sub-queries may the executed according to the results.
One of the challenges in an MPP database system has always been in setting up the distributed system and distributing the data. How data is distributed and how much the distribution is aligned with the business logic greatly determines the overall performance of the system.