Timely and cost-effective analytics over large datasets is now a crucial requirement in many businesses and scientific disciplines. Web search engines and social networks log every user action. The logs are often analyzed at frequent intervals to detect fraudulent activity, to find opportunities for personalized advertising, and to improve website structure and content. Scientific fields, such as biology and economics, have fast growing computational subareas.
The MapReduce framework—which includes a programming model and a scalable and fault-tolerant run-time system—is now a popular platform for data analytics. HADOOP®software, available from Apache Software Foundation, is an open source implementation of MapReduce used in production deployments. HADOOP® software is used for applications such as log file analysis, Web indexing, report generation, machine learning research, financial analysis, scientific simulation, and bioinformatics.
Cloud platforms make MapReduce an attractive proposition for small organizations that need to process large datasets, but they lack the computing and human resources of large companies or other organizations. Elastic MapReduce, for example, is a hosted service on the Amazon Web Services cloud platform where a user can instantly provision a HADOOP® cluster running on any number of Elastic Compute Cloud (EC2) nodes. The cluster can be used to run data-intensive MapReduce jobs, and then terminated after use. The user must pay for the nodes provisioned to the cluster for the duration of use.
It is desired to provide improved techniques for analyzing and managing large datasets. Further, it is desired to improve the performance and management of MapReduce jobs and distributed data processing systems.