Query statements can be formed to obtain data from distributed storage and distributed processing resources. The distributed storage may be a distributed database or a distributed file system. Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
The core of Apache Hadoop® consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce®). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged code for nodes to process in parallel, based on the data each node needs to process. This approach takes advantage of data locality (nodes manipulating the data that they have) to allow the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.
The Hadoop ecosystem has a variety of access methods. Apache Hive® is a data warehouse infrastructure built on top of Hadoop for data summarization, query and analysis. Apache Spark® is an open source cluster computing framework that allows user programs to load data into a cluster's memory and query it repeatedly. Solr® is an open source enterprise search platform that enables full-text search, hit highlighting, faceted search real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling.
Each access method has a query language associated with it to specify what data should be returned by the server and what operations should be done with the data. These data access query languages typically behave like set theory operations, which are neither purely object-oriented nor purely procedural. Therefore, they are not easily broken down into components that can be re-engineered.
Therefore, it would be desirable to identify techniques to parse queries associated with different access methods. Further, it would be desirable to provide techniques for reconstructing queries associated with different access methods to enforce a policy.