In big data analytics, since the amount of to-be-processed data is enormous, for example, attaining PB (Petabyte), EB (Exabyte) or ZB (Zettabyte) levels, to avoid huge data traffic, i.e. “move compute closer to data” becomes an important rule of optimization in big data analytics.
Traditional big data analytics systems take system configurations in the node cluster form. FIG. 1 shows a system configuration of a traditional system for big data analytics. The system configuration takes the form of a cluster 100 comprising a plurality of nodes 110-1, 110-2, 110-3 and 110-4 (collectively called node 110). Each node 110 in cluster 100 may be implemented as one physical or virtual server, and the big data analytics system stores data with commodity data storage (e.g., a hard disk) built in servers. Hence, each server (i.e. node 110) in cluster 100 not only acts as a computing node C but also a storage node S (wherein a set of the computing nodes C constitute a computing layer 120, while a set of the storage nodes S constitute a storage layer 130). Thereby, the system configuration shown in FIG. 1 makes full use of the rule of optimization as “move computing closer to data”.
In particular, the system configuration allows the big data analytics system to optimize the data placement and computing task scheduling strategies so as to maximize the chances that computing tasks will be conducted at the local node where the corresponding computing data is located (i.e. data co-locality), i.e. moving computing tasks closer to corresponding computing data as far as possible, and thereby minimizing the data transmitted over the network.
As the application deployment evolves, especially in the enterprise-grade application deployment, the commodity data storage built in servers in more and more big data analytics systems is substituted for enterprise-grade external storage systems (e.g. EMC Isilon) with enhanced performance and enriched data services (e.g. snapshots, backup, deduplication, access control, etc.). However, the computing layer and the storage layer in enterprise-grade external storage systems are discrete, so the above-mentioned rule of optimization “move compute closer to data” is no longer straightforwardly applicable to such big data analytics systems.