After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex. The average consumer expects electronic transactions to occur flawlessly and with near instant speed. The enterprise that cannot meet expectations of the consumer is quickly out of business in today's highly competitive environment.
Consumers have a plethora of choices for nearly every product and service, and enterprises can be created and up-and-running in the industry in mere days. The competition and the expectations are breathtaking from what existed just a few short years ago.
The industry infrastructure and applications have generally answered the call providing virtualized data centers that give an enterprise an ever-present data center to run and process the enterprise's data. Applications and hardware to support an enterprise can be outsourced and available to the enterprise twenty-four hours a day, seven days a week, and three hundred sixty-five days a year.
As a result, the most important asset of the enterprise has become its data. That is, information gathered about the enterprise's customers, competitors, products, services, financials, business processes, business assets, personnel, service providers, transactions, and the like.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications.
In response, the industry has recently embraced a data platform referred to as Apache Hadoop™ (Hadoop™). Hadoop™ is an Open Source software architecture that supports data-intensive distributed applications. It enables applications to work with thousands of network nodes and petabytes (1000 terabytes) of data. Hadoop™ provides interoperability between disparate file systems, fault tolerance, and High Availability (HA) for data processing. The architecture is modular and expandable with the whole database development community supporting, enhancing, and dynamically growing the platform.
However, because of Hadoop's™ success in the industry, enterprises now have or depend on a large volume of their data, which is stored external to their core in-house database management system (DBMS). This data can be in a variety of formats and types, such as: web logs; call details with customers; sensor data, Radio Frequency Identification (RFID) data; historical data maintained for government or industry compliance reasons; and the like. Enterprises have embraced Hadoop™ for data types such as the above referenced because Hadoop™ is scalable, cost efficient, and reliable.
One challenge in integrating Hadoop™ architecture with an enterprise DBMS is selectively acquiring data from Hadoop™ and importing and using that data within the enterprise DBMS.
Recently, a table-based User-Defined Function (UDF) approach was introduced to allow DBMS users to have direct access to Hadoop™ files in Structured Query Language (SQL) queries. The basic idea is that a customized table UDF pulls data from the Hadoop™ distributed file system (HDFS) into the enterprise data warehouse for manipulation. Each table UDF instance runs on a particular Access Module Processor of the parallel DBMS and is responsible for retrieving a portion of the HDFS file defined in the table UDF.
However, there are a few problems with this approach, such as the three problems listed below.
Firstly, the access code to HDFS is hard-coded in the table UDF and the mapping from HDFS files to the DBMS relational data is also hard-coded. This means that if a user needs to access different columns in the same HDFS file or convert the same columns in the HDFS file to different types, a different Table UDF has to be programmed and used. This approach is not productive when the users frequently need to access different HDFS files or need access to the same HDFS file in different ways.
Secondly, although data filtering and transformation can be done by the UDF as the rows are delivered by the UDF to the SQL processing, the HDFS files are first transferred to the parallel DBMS even when later during processing on the DBS some data in the imported file are to be discarded. This can be an unfortunate waste of network bandwidth.
Thirdly, users have to develop and write the customized UDFs, which require knowledge of the Hadoop™ system and its file system's Application Programming Interface (API), and the users also need knowledge of the DMBS's table UDF infrastructure.