After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex. The average consumer expects electronic transactions to occur flawlessly and with near instant speed. The enterprise that cannot meet expectations of the consumer is quickly out of business in today's highly competitive environment.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications.
In response, the industry has recently embraced a data platform referred to as Apache Hadoop™ (Hadoop™). Hadoop™ is an Open Source software architecture that supports data-intensive distributed applications. It enables applications to work with thousands of network nodes and petabytes (1000 terabytes) of data. Hadoop™ provides interoperability between disparate file systems, fault tolerance, and High Availability (HA) for data processing. The architecture is modular and expandable with the whole database development community supporting, enhancing, and dynamically growing the platform.
The big data analysis on Hadoop™ Distributed File System (DFS) is usually divided into a number of worker tasks, which are executed in a distributed fashion on the nodes of the cluster.
These worker tasks in Hadoop™ MapReduce™ are executing map and reduce tasks. There are typically a large number of worker tasks, far more than the cluster can execute parallel. In a typical MapReduce™ workload the number of map tasks may be orders of magnitude larger than the number of nodes, and while the number of reduce tasks is usually lower, it will still usually be equal to the number of nodes or a small multiple of that. Each worker task is responsible for processing a part of the job's data. Map tasks process a part of the input split into data portions, and reduce tasks process a partition of the intermediate data.
If worker tasks do not take the same amount of time to execute as remaining tasks, this is called a skewed scenario.
There are number of reasons why skew can occur between worker tasks. Data skew means that not all tasks process the same amount of data. Those tasks that process more input data will likely take longer to execute. Data skew can occur due to the properties of the input data, but also for example, due to a poor choice and use of partitioning functions. Processing skew occurs when not all records in the data take the same amount of time to process. So, even if the tasks process the same amount of data and records, there can still be a large discrepancy in their execution times. This may occur due to the computing resources in the cluster, which are actually heterogeneous, with some nodes having faster processors, more network bandwidth, more memory, or faster disks than others. These nodes will be able to process data faster than the others, and run the same tasks faster.
The end result of these factors is that there can be a large variation between the execution time of the worker tasks. When this occurs, some worker tasks may hold up the execution time of the entire job, either because other worker tasks cannot proceed until they are finished or because they are simply the last worker task in a job.
The applications experience performance degradation due to skews on Hadoop™ DFS. The resources are not fully utilized and performances of big data analytics become delayed. This will impact data warehouse system performance for unified data architectures as other systems tasks become delayed due to delay of data synchronization from Hadoop™ to other systems.