Data warehousing applications involve processing of terabytes of data, most of the queries involving complex joins. To solve these problems involving large amounts of data, Google has come up with the notion of Map-reduce and also distributed storage architecture called “Big Table”. Apache released the open-source implementations of both of these ideas (Hadoop and HBase). Also Apache started a data warehousing application called Hive as a subproject of Hadoop.
The existing technology mostly addresses traditional relational databases which follow row-store. However, cloud databases which are relational nature follow column-store. Few organizations like Amazon provide APIs to interact with the cloud data. Most of these APIs are specific to their own cloud databases, but not provide a generic APIs to interact with them. For example, Hadoop provides its own syntax which differs from the syntax from Amazon EC2. On the other hand, they are mostly focuses on inserting data, rather than providing an SQL kind of interface to retrieve the data.
As cloud computing is a new area, there may be several works, at initial stages. Some of the products/tools/software related to some extent of this invention are Hive, Pig and JAQL.
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language. Hive doesn't explore the advantages of column-oriented data stores and also doesn't have a cost-based query optimizer.
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties such as Ease of programming, Optimization opportunities and Extensibility.
Ease of programming: It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities: The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility: Users can create their own functions to do special-purpose processing.
JAQL is a query language for JavaScript Object Notation or JSON. Although JAQL has been designed specifically for JSON, have tried to borrow some of the best features of SQL, XQuery, LISP, and PigLatin. JAQL is a functional query language that provides users with a simple, declarative syntax to do things like filter, join, and group JSON data. JAQL also allows user-defined functions to be written and used in expressions. Their high-level design objectives include are Semi-structured analytics, Parallelism and Extensibility.
Semi-structured analytics: easy manipulation and analysis of JSON data.
Parallelism: JAQL queries that process large amounts of data must be able to take advantage of scaled-out architectures.
Extensibility: users must be able to easily extend JAQL.
The limitations of the existing technology are such that no open standard for cloud data interfaces. So, if the cloud infrastructure is using differing libraries to interact, code need to be re-written to each of the library separately. So, hosting an existing enterprise-based application over cloud has to take place again a complete software development life cycle. Similar challenges arise when migrate data from one cloud to the other. This task is not only time consuming and costly, but it also introduces new bugs and troubles as the developer may not be well equipped to use write API's for each of the cloud the user needs to access.
Thus, there is a need to overcome the problems of the existing technology. Therefore, the present inventors have developed computer-implemented methods, systems and computer-readable media for providing a query layer for cloud databases which would propose using one common implementation interface which is a modified version of Structured Query Language (SQL) to interact in a platform independent manner, thus make it more generic in nature. It also builds efficient cost based optimizer that reduces the number of selection and join operations over column-store databases.