One of the key advantages of storing large amounts of data in a database is that a specific subset of the stored data can be retrieved in an organized manner. Often, the subset of the stored data that is retrieved is analyzed to study various indications, such as economic trends, consumer reactions, and the like. To learn about customers, businesses are collecting various types of information about their customers, such as personal data, geographic/demographic data, purchasing habits, and so forth. Such customer data are stored in a database system, such as in a relational database management system (RDBMS), where the data can be processed and sorted into a format suitable for reporting or analysis. An example of a database system in which such information is collected is a data warehouse in which data is input from a variety of sources and organized into a format that is structured for query and analysis or reporting. The volume of data collected in most large data warehouses is at least several gigabytes and often exceeds tens or even hundreds of terabytes.
To handle the massive amount of data that is collected and processed in such data warehouses, sophisticated platforms are typically employed. The platforms include parallel processing systems, such as massive parallel processing (MPP) systems or symmetric multiprocessing (SMP) systems. An MPP system typically is a multi-node system having a plurality of physical nodes interconnected by a network. An SMP system typically is a single-node system having multiple processors. Collected data is stored in storage devices in such systems, which are accessible by the various nodes or processors to perform processing. In a parallel system, stored data portions are accessible in parallel to increase access speeds.
Many times a user interfaces with a database system to implement several tasks. These tasks include storing data, retrieving data, performing data queries, and the like. In order to utilize computing resources efficiently, these tasks can be performed in parallel. The control of the tasks described above can be performed in the database, a database server system, or from a remote system such as a client system.
In conventional parallel processing environments, a typical application usually includes several tasks. Each of these tasks is generally responsible for a portion of an application's workload. Sometimes, an application can be parallelized based upon its functions; that is, each task can perform a different function. This process is called functional parallelism. Another way of parallelizing an application is to divide its input, output or intermediate data into multiple portions and to assign a task for each data portion. This method is often called data parallelism. In either case, each such task is usually independent of other tasks in the application in the sense that they do not need to share internal processing states or information with each other. Consequently, each task can be executed independently of other tasks, in a concurrent or simultaneous manner, the latter being the case in systems with multiple processors
One of the difficulties frequently encountered in implementing a parallel application is the need to coordinate the processing of the individual parallel tasks. Currently, the most common approaches used to address this issue are broadcasting coordination requests from every task to all the other tasks and creating a central component that dictates and controls the processing and communication between the tasks. While the first approach is suitable for a certain class of parallel applications, it frequently leads to increased complexity in the design of the parallel application and also introduces increased communication overhead, which can impede the application's scalability.
The second approach is frequently used in transactional and database systems (e.g. the 2-phase commit protocol). It, too, restricts the flexibility in the design of the individual parallel tasks in the sense that the processing in each task is dictated by a statically defined protocol (such as a fixed number of steps or phases) implemented in a controller. This static protocol, which does not change from one application to another, lacks application-specific semantics that are usually required by complex applications such as ETL (Extract, Transform, Loading) applications used in a data warehousing environment.
Conventional parallel execution of data tasks employ synchronization functions; such as WAIT, POST, LOCK, UNLOCK, GROUP, BARRIER, and the like; which are generally platform-dependent. These platform-dependent synchronization functions require multiple implementations in order for applications to run in a heterogeneous environment. Multiple implementations of the synchronization functions are less efficient and require valuable computer resources.