In present day scenario, considering the exponential increase in data storage requirements and drastic reduction in storage cost per gigabytes, aggressive optimization of data written to storage media is not seriously taken. This leads to generation of extremely large unmanaged datasets distributed across multiple systems, cloud based systems, and various other places. Querying these large datasets entails efforts as it is not known which data lies where. Further, there are no mechanisms available for creating common index for pulling out data from extremely large datasets spread across various systems and for handling such data efficiently.
Currently, the data is spread across multiple servers which are interconnected with each other. Various techniques are being developed to leverage the collective power of all the interconnected servers. The main problem is how to efficiently make use of data resources spread across servers available as a single pool of resources for data processing applications, i.e., how to deal with extremely large datasets, for example, (archived official datasets for a company, video surveillance data, web crawled data for search engine) which may be only unstructured data and is continuously expanding with time. The main problems associated with such kind of data are as follows:                Lack of proper centralized yet distributed storage space        Lack of computing services available for the given data size.        No proper access mechanism.        Data is unstructured.        Access is very low.        
In view of the above drawbacks, it would be desirable to have a mechanism to use large datasets spread across systems in an efficient and fault tolerant manner in real time.