Technical Field
The present teaching relates to methods and systems for data processing. Particularly, the present teaching is directed to methods, systems, and programming to maximize throughput of processing jobs.
Discussion of Technical Background
The advancement of the Internet has made it possible to make a tremendous amount of information accessible to users located anywhere in the world. It is estimated that hundreds of exabytes of information is stored in digital form. Content providers, such as businesses, government agencies, and individuals, generate large amounts of both structured and unstructured data which, in order to be accessible online, must be processed, analyzed, and stored. With the explosion of information, new issues have arisen. First, much effort has been put into organizing the vast amount of information to facilitate the search for information in a more effective and systematic manner. Due to the large volume of content that is presently available and is continually generated, traditional data computing techniques are inadequate to facilitate processing large volumes of data that may be terabytes or petabytes in size.
A number of large scale data processing and analysis tools have been developed to process large volumes of information. Many of these tools make use of cloud computing, which involves a number of computers connected through a real-time communication network, such as the Internet. Cloud computing allows computational jobs to be distributed over a network and allows a program to be concurrently run on many connected computers. The network resources may be shared by multiple users or may be dynamically re-allocated to accommodate network demand. As such, cloud computing solutions are often designed to maximize the computing power of the network and the efficiency of the network devices. This distributed processing configuration allows an entity to avoid upfront infrastructure costs associated with computing equipment.
Apache Hadoop (Highly Available Distributed Object Oriented Platform) is a Java-based programming framework and one of the most popular large scale data processing and analysis tools presently available, Hadoop Distributed File System, is a distributed file system designed to hold terabytes or even petabytes of data and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. Google's MapReduce is another programming model for processing large scale data sets that makes use of a parallel, distributed algorithm. Hadoop is an open source implementation of the MapReduce platform and distributed file system.
Existing large scale data processing and analysis tools offer users scalable, reliable services that are easy to use. For example, Yahoo! offers its users a large scale partner feed processing system that interfaces with various hosted services for enrichment of partner feeds. These hosted services typically provision a limited quota of its resources to a new user during on-boarding and the number of machines involved depends on input size and cluster size. However, present technologies, including cloud services, may be overloaded by large-scale processing jobs. There is a need adequately control and maximize throughput of network intensive processing jobs.
There is presently no solution which makes adequate use of increasing throughput of processing jobs by concurrent utilization of multiple network resources. Within a large scale data processing platform, processing data from partner feeds relies heavily on resources provided by cloud based systems. Not only do cloud based systems provide necessary storage, but the systems may enrich processed data with, by way of example, geographic information, context analysis, or license management information. Processing tasks are provisioned among resources available on the network. In many cloud based systems, the allocation of resources to a particular user or job is done based, in part, on peak usage. For example, in case of Hadoop-based feed processing, peak usage is determined by input feed size. However, one drawback to provisioning is the high cost associated with the necessary dedicated hardware. In addition, the sequential processing of data by multiple services in existing data processing systems is inherently limited in that only one service at a time may be utilized. For example, while running a processing stage for enriching data, it is not possible to utilize services upload data to an ingestion buffer or content grid. In addition, large input size will cause existing data processing systems to generate a large number of map tasks, which can overload the entire system.
Although cloud based systems allow for large scale data processing to be distributed across multiple machines in a network, cloud services are largely underutilized when provisioned for daily peak capacity. By way of example, a client may overload cloud services beyond the allocated quota, resulting in underperformance or outage of the services. Overload poses a systemic risk to cloud services and there has been significant investment in overload protection for these services. Although resource intensive processing jobs pose serious risk of system overload, processing systems may remain largely unused the majority of the time they are online. For example, a batch processing system may be utilized for a period of less than three hours per day at an image upload rate of 360 uploads per second, yet the cloud provisioning done to handle this rate of request could remain unused for rest of the day. One option to reduce the amount of network resources required would involve limiting the rate at which upload requests are made such that the system is utilized for a longer time period. However, no existing solutions provides achieve this goal without adjusting or otherwise relying on input feed size.
The use large scale data processing platforms to process network intensive jobs poses a number of challenges. For example, these solutions may over provision the cloud service, due to scalability issues. By way of further example, a drastic increase in data input size could cause overload on the service, resulting in slowed performance speeds or outages. In addition, processing network intensive jobs may result in considerable waste of cluster capacity as the rate of processing will be limited by the network service. These factors, among others, affect the overall throughput and number of records processed per second by the platform. As many large scale data processing platforms are not optimized for such processing, a solution is needed that would to allow existing platforms to perform network intensive processing jobs.
In addition, a solution is needed that would control the rate at which requests are made to particular processing services, while maximizing overall throughput. Existing solutions for controlling request rates to achieve high throughput include establishing a set number of reduce nodes and performing all network operations in a single reduce phase. However, existing reduce based solutions suffer from several drawbacks. For example, in the event that a web service does not accept batch requests, reduce based solutions must to allocate large number of reduce nodes, each of which must wait until all map tasks have been completed, resulting in underutilization of grid nodes. Further, reduce based solutions require additional data transfer from Map Nodes to Reduce Nodes. Existing solutions may also require overload protection services and implement error handling, such as exponential back-off in map process, resulting in inefficiency and high cost throughput.