After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex. The average consumer expects electronic transactions to occur flawlessly and with near instant speed. The enterprise that cannot meet expectations of the consumer is quickly out of business in today's highly competitive environment.
Consumers have a plethora of choices for nearly every product and service, and enterprises can be created and up-and-running in the industry in mere days. The competition and the expectations are breathtaking from what existed just a few short years ago.
The industry infrastructure and applications have generally answered the call providing virtualized data centers that give an enterprise an ever-present data center to run and process the enterprise's data. Applications and hardware to support an enterprise can be outsourced and available to the enterprise twenty-four hours a day, seven days a week, and three hundred sixty-five days a year.
As a result, the most important asset of the enterprise has become its data. That is, information gathered about the enterprise's customers, competitors, products, services, financials, business processes, business assets, personnel, service providers, transactions, and the like.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications.
In response, the industry has recently embraced a data platform referred to as Apache Hadoop™ (Hadoop™). Hadoop™ is an Open Source software architecture that supports data-intensive distributed applications. It enables applications to work with thousands of network nodes and petabytes (1000 terabytes) of data. Hadoop™ provides interoperability between disparate file systems, fault tolerance, and High Availability (HA) for data processing. The architecture is modular and expandable with the whole database development community supporting, enhancing, and dynamically growing the platform.
However, because of Hadoop's™ success in the industry, enterprises now have or depend on a large volume of their data, which is stored external to their core in-house database management system (DBMS). This data can be in a variety of formats and types, such as: web logs; call details with customers; sensor data, Radio Frequency Identification (RFID) data; historical data maintained for government or industry compliance reasons; and the like. Enterprises have embraced Hadoop™ for data types such as the above referenced because Hadoop™ is scalable, cost efficient, and reliable.
Furthermore, in-database analytics is getting popular because data computation is being moved closer to the data. As a result, there are increasing customer demands to export data warehouse (parallel DBMS) data to external servers where complicated data analysis, such as graph analysis can be performed. A popular trend is to use Hadoop™ MapReduce™ customized modules to perform data analysis on exported data. A particular problem in exporting data to a parallel computing platform such as Hadoop™ is that existing DBMS's currently do not have the functionality/infrastructure to support application-directed data partition in the exporting process. For example, a transaction table in a DBMS might be physically partitioned by transaction identifier. A Hadoop™ application might want to start up multiple tasks to analyze the transaction history by area (zip code), such that each MapReduce™ task receives the complete transactions for any zip code it sees and then performs some application-specific analysis.
The following describes two current possible solutions used and in the industry and which are not efficient.
A first approach is to export the transaction table to the Hadoop™ system as a HDFS (Hadoop™ Distributed File System (DFS)) file and then run a Hadoop™ job, which manually partitions data by zip code to perform a desired analysis. Basically, mappers read the data and partition them by zip code, and reducers perform the analysis. This approach requires physical data movement in the Hadoop™ system and often is not what customers want from a DBMS's solution, since they want simplified application logic.
The second approach is really streamlined version of the first approach. It uses mappers to directly talk to the DBMS and retrieve data. In the ideal case, where the transaction table is already a Partitioned Primary Index (PPI) table partitioned by zip code, then each mapper can directly send Structured Query Language (SQL) queries to the DBMS to retrieve some partitions. Therefore, no data redistribution is needed on the Hadoop™ side and the mappers themselves can perform the same analysis as is done by the reducers in the first approach. However, this still uses the horizontal partition approach and does not scale as well as a vertical partitioning based approach. Furthermore, when the transaction table is not a PPI table, or if it is a PPI table not partitioned by zip code, each mapper still needs to either retrieve some portion of the transaction data and then redistribute the data by zip code to reducers, or the mappers can request the DBMS to create a new PPI table partitioned by zip code to avoid data redistribution in the Hadoop™ system. Either way, the processing is not efficient.