Embodiments of the present invention relate to the field of databases, and more specifically, for searching/querying a database.
With the constant development and improvements of database technology, requirements for database search/query techniques also get increasingly demanding. In practice, the MapReduce framework has been widely recognized as an efficient approach for big data analytics in a large cluster system. MapReduce application development requires the developers to code the application logic into the simple interfaces exposed by MapReduce (i.e. the map and reduce functions). Although such map and reduce interfaces offer extremely high programming flexibility, they often become challenging to implement, optimize and maintain, especially for non-trivial and complex data analysis tasks that are involved in practical productions. As is evident from the success of relational database technology (i.e. SQL), program development and optimization would be more efficient and effective if data processing programs were written in a declarative query language that hides the implementation details and is amenable to optimization. In this case, users can directly write declarative queries, which are then translated into a sequence of MapReduce programs (jobs) to be executed by the MapReduce platform (e.g. Hadoop).
In recent times, several declarative languages have been proposed and integrated into MapReduce-based systems, such as Pig Latin/Pig, HiveQL/Hive. In these systems, users directly write declarative queries, which are then translated into a sequence of MapReduce programs (jobs) to be executed by the MapReduce platform (e.g. Hadoop). These languages have significantly improved the productivity of MapReduce application developers. However, their effects and impact are limited by two major issues. On one hand, till date only few specialized optimization techniques have been exploited during the query translation procedure. As a result, in practice it is observed that automatically translated MapReduce programs for many queries are often extremely inefficient compared to hand-optimized programs that are written by experienced programmers. For example, existing language translators take the local one-operation-to-one-job approach, which simply replaces each operation in the query graph with a prepared MapReduce program. On the other hand, existing MapReduce languages provide a limited syntax for operating on data collections, mainly in the form of well-known relational joins and group-bys. As such, these languages enable users to plug-in custom MapReduce scripts into their queries. This actually nullifies the benefits of using a declarative language and may result in sub-optimal, error-prone and hard-to-maintain code.