Many data analytics applications (e.g., recommendation engines) require probabilistic computations through specialized probabilistic data structures (e.g., bloom filtering, linear counting, Log Log algorithms, element cardinality, count-min algorithms) based on very large amounts of data. This typically requires fetching large quantities of data from one or more storage servers and transmitting the data to a computational node for processing. Currently there are no viable solutions to offload and execute probabilistic based computations on distributed file storage servers (e.g., NFS/CIFS servers). Some approaches cause a large increase in network usage, NAS-side CPU usage, memory usage, and NIC usage for transferring raw data to the node for processing. Other drawbacks include data fetching latency and the consumption of application server resources during computation.
Very large data sets are common in the web domain or the data analytics domain. Many models are used to process large scale data such as MapReduce. An example of such large scale data processing is the Hadoop ecosystem which is based on querying large data sets for analytics. For such large scale data, memory resources are often a limiting factor. Various algorithms have been explored which attempt a compromise between the amount of memory used and the precision required. Because analytics require an estimate of the result of a query, a variance in the expected values is tolerable and safe if based on the computation model used to calculate the value.