Field
The present disclosure relates to graph analytics. More specifically, this disclosure relates to a method and system for pruning graph representation for facilitating efficient processing of graph data.
Related Art
The exponential growth of computing power has made it possible to extract information of interest, such as shopping preferences and/or recommendations, social media activities, medical referrals, and e-mail traffic patterns, using efficient data analysis. Such data analysis requirements have brought with them an increasing demand for efficient computation. As a result, equipment vendors race to build larger and faster computing devices with versatile capabilities, such as graph analysis, to calculate information of interest efficiently. However, the computing capability of a computing device cannot grow infinitely. It is limited by physical space, power consumption, and design complexity, to name a few factors. Furthermore, computing devices with higher capability are usually more complex and expensive. More importantly, because an overly large and complex computing device often does not provide economy of scale, simply increasing the capability of a computing device may prove economically unviable.
One way to meet this challenge is to increase the efficiency of data analysis tools used for extracting information of interest from a large data set. Hipergraph is a high-performance graph analytics engine that performs very fast queries on graph data. Graph data is data that can be easily represented by a graph. A graph is a set of vertices with edges that connect them. Hipergraph requires the input to be in a very specific format, but formatting many real-world graph datasets is non-trivial because the formatting operations exceed the typical memory and disk capacities of a single machine.
In one approach, one can perform automated compilation and formatting of data using scripts and UNIX utilities. This approach works relatively well when the inputs and output files and intermediary computation fit on a modern workstation. However, when the input graph dataset is on the order of several hundred gigabytes, one cannot even sort the data on a standard machine because of the time, disk space, and memory space required.