Embodiments of the invention relate generally to joining data across computer systems with disparate file systems, and more specifically, to joining data across a parallel database and a distributed processing system.
Parallel databases have long been used by enterprises to manage and analyze their important data. In recent years, as the advent of the big data movement, Hadoop and related systems are increasingly being used for big data analytics in distributed clusters. In particular, the Hadoop Distributed File System (HDFS) serves as the core storage system where other distributed processing systems, such as MapReduce, Spark, Impala and Giraph, access and operate on the large volumes of data.
In general, parallel databases and Hadoop are two very different processing environments. First of all, parallel databases excel in SQL processing with decades of research and development in query optimization, whereas big data environment excels at scalable and more flexible data processing, but does little query optimization. In addition, while parallel databases use high-end or specialized hardware, Hadoop clusters usually consist of commodity hardware. Finally, parallel databases store and process critical structured data, like transactions, whereas Hadoop clusters are more suitable for semi-structured log data or unstructured text data.
Although different, the two environments are very complementary to each other. In fact, recently, there has been interest in combing data across both environments to create more business value for enterprises. One such example is to combine transaction data from parallel databases and user click log data from Hadoop to correlate customer online behavior with sales data for retailers.