After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex. The average consumer expects electronic transactions to occur flawlessly and with near instant speed. The enterprise that cannot meet expectations of the consumer is quickly out of business in today's highly competitive environment.
Consumers have a plethora of choices for nearly every product and service, and enterprises can be created and up-and-running in the industry in mere days. The competition and the expectations are breathtaking from what existed just a few short years ago.
The industry infrastructure and applications have generally answered the call providing virtualized data centers that give an enterprise an ever-present data center to run and process the enterprise's data. Applications and hardware to support an enterprise can be outsourced and available to the enterprise twenty-four hours a day, seven days a week, and three hundred sixty-five days a year.
As a result, the most important asset of the enterprise has become its data. That is, information gathered about the enterprise's customers, competitors, products, services, financials, business processes, business assets, personnel, service providers, transactions, and the like.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications. In fact, the data and processing can be geographically dispersed over the entire globe. When processing against the data, communication may need to reach each node or communication may entail select nodes that are dispersed over the network.
Optimizing the shortest communication path between nodes is referred to as the shortest path problem, which is associated with graph analysis. The single source shortest path problem is the problem of finding a shortest path between a single vertex (node) and every other vertex (node) in the graph (network). Again, this problem is complex because with large scale processing, the data spans nodes across the globe, and processing of data cannot be handled on a single node. Moreover, to improve throughput multiple nodes often processing in parallel on different portions of the data.
In response, the industry has recently embraced a data platform referred to as Apache Hadoop™ (Hadoop™). Hadoop™ is an Open Source software architecture that supports data-intensive distributed applications. It enables applications to work with thousands of network nodes and petabytes (1000 terabytes) of data. Hadoop™ provides interoperability between disparate file systems, fault tolerance, and High Availability (HA) for data processing. The architecture is modular and expandable with the whole database development community supporting, enhancing, and dynamically growing the platform.
However, because of Hadoop's™ success in the industry, enterprises now have or depend on a large volume of their data, which is stored external to their core in-house database management system (DBMS). This data can be in a variety of formats and types, such as: web logs; call details with customers; sensor data, Radio Frequency Identification (RFID) data; historical data maintained for government or industry compliance reasons; and the like. Enterprises have embraced Hadoop™ for data types such as the above referenced because Hadoop™ is scalable, cost efficient, and reliable.
Enterprises want a cost-effective solution to access relational data from Hadoop™ using a MapReduce™ solution, which heretofore has been elusive and spotty at best in the industry. However, some companies have sought to develop their own map reduce features to improve on the Hadoop™ approach. One such advancement has occurred with Aster Data™ and its extension of Structured Query Language (SQL) with its Map Reduce (MR) processing embedded in standard SQL as enhancements and referred to as SQL/MR.
That is, enterprise's want the ability to access their internally-maintained DBMS's via Hadoop™ MapReduce™ implementations to improve information integration, scalability, maintenance, and support issues.