This invention relates generally to massively parallel processing (MPP) data storage systems and methods for big data applications, and more particularly to new and improved MPP system architectures comprising large clusters of commodity servers, and associated query execution models for accessing data in such systems.
Most successful companies use data to their advantage. The data are no longer easily quantifiable facts, such as point of sale transaction data. Rather, companies retain, explore, analyze, and manipulate all the available information in their purview. Ultimately, they may analyze the data to search for evidence of facts, and insights that lead to new business opportunities or which leverage their existing strengths. This is the business value behind what is often referred to as “Big Data”.
Big data is “big” because it comprises massive quantities, frequently hundreds of terabytes or more, of both structured and unstructured data. Among the problems associated with such big data is the difficulty of quickly and efficiently analyzing the data to obtain relevant information. Conventional relational databases store structured data and have the advantage of being compatible with the structured query language (SQL), a widely used powerful and expressive data analysis language. Increasingly, however, much of big data is unstructured or multi-structured data for which conventional relational database architectures are unsuited, and for which SQL is unavailable. This has prompted interest in other types of data processing platforms.
The Apache Software Foundation's open source Hadoop distributed file system (HDFS) has rapidly emerged as one of the preferred solution for big data analytics applications that grapple with vast repositories of unstructured or multi-structured data. It is flexible, scalable, inexpensive, fault-tolerant, and is well suited for textual pattern matching and batch processing, which has prompted its rapid rate of adoption by big data. HDFS is a simple but extremely powerful distributed file system that can be implemented on a large cluster of commodity servers with thousands of nodes storing hundreds of petabytes of data, which makes it attractive for storing big data. However, Hadoop is a non-SQL compliant, and, as such, does not have available to it the richness of expression and analytic capabilities of SQL systems. SQL based platforms are better suited to near real-time numerical analysis and interactive data processing, whereas HDFS is better suited to batch processing of large unstructured or multi-structured data sets.
A problem with such distinctly different data processing platforms is how to combine the advantages of the two platforms by making data resident in one data store available to the platform with the best processing model. The attractiveness of Hadoop in being able to handle large volumes of multi-structured data on commodity servers has led to its integration with MapReduce, a parallel programming framework that integrates with HDFS and allows users to express data analysis algorithms in terms of a limited number of functions and operators, and the development of SQL-like query engines, e.g., Hive, which compile a limited SQL dialect to interface with MapReduce. While this addresses some of the expressiveness shortcomings by affording some query functionality, it is slow and lacks the richness and analytical power of true SQL systems.
One reason for the slowness of HDFS with MapReduce is the necessity for access to metadata information needed for executing queries. In a distributed file system architecture such as HDFS the data is distributed evenly across the multiple nodes. If the metadata required for queries is also distributed among many individual metadata stores on the multiple distributed nodes, it is quite difficult and time-consuming to maintain consistency in the metadata. An alternative approach is to use a single central metadata store that can be centrally maintained. Although a single metadata store can be used to address the metadata consistency problem, it has been impractical in MPP database systems. A single central metadata store is subject to large numbers of concurrent accesses from multiple nodes running parallel queries, such as is the case with HDFS, and this approach does not scale well. The system slows rapidly as the number of concurrent accesses to the central store increases. Thus, while HDFS has many advantages for big data applications, it also has serious performance disadvantages. A similar problem exists in using a central metadata store in conventional MPP relational databases that requires large numbers of concurrent access. What is needed is a different execution model and approach for executing queries in such distributed big data stores.
It is desirable to provide systems and methods that afford execution models and approaches for massively parallel query processing in distributed file systems that address the foregoing and other problems of MPP distributed data storage systems and methods, and it is to these ends that the present invention is directed.