With the increasing development of technology, larger amounts of data are being collected and stored for later extraction and processing. As data amounts grow, the ability to process such data into useful information becomes increasingly difficult. Large scale data processing in conventional parallel and distributed processing environments has been developed to include the distribution of data and analysis among multiple storage devices and processors to provide use of aggregated storage and increased processing power. However, these systems equally suffer from a variety of drawbacks that prevent efficient processing.
Conventional architectures and computing environments for a distributed system include, for example, servers configured to collect and locally store data in one or more databases. The ability to process the stored data may use, for example, a scheduler to manage and access data stored in each database. As the number of servers collecting and storing data continues to grow, it becomes increasingly difficult to coordinate proper communication between the servers. Thus, various replication and synchronization processes must be performed. If and once the data has been collected in a central location (e.g., common data source, such as a database), the data must then be stored in such a manner to be processed. Processing of the data conventionally includes techniques such as centralized schedulers to distribute data loads between processing devices, sticky distribution in which data is divided based on certain attributes that are used to distribute the data based on data characteristics, a master/slave architecture in which data is distributed from a master to the slave, and systems in which processing devices register with the system and are allocated data for processing based on work load distribution.