After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex. The average consumer expects electronic transactions to occur flawlessly and with near instant speed. The enterprise that cannot meet expectations of the consumer is quickly out of business in today's highly competitive environment.
Consumers have a plethora of choices for nearly every product and service, and enterprises can be created and up-and-running in the industry it mere days. The competition and the expectations are breathtaking from what existed just a few short years ago.
The industry infrastructure and applications have generally answered the call providing virtualized data centers that give an enterprise an ever-present data center to run and process the enterprise's data. Applications and hardware to support an enterprise can be outsourced and available to the enterprise twenty-four hours a day, seven days a week, and three hundred sixty-five days a year.
As a result, the most important asset of the enterprise has become its data. That is, information gathered about the enterprise's customers, competitors, products, services, financials, business processes, business assets, personnel, service providers, transactions, and the like.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications.
In response, the industry has recently embraced a data platform referred to as Apache Hadoop™ (Hadoop™). Hadoop™ is an Open Source software architecture that supports data-intensive distributed applications. It enables applications to work with thousands of network nodes and petabytes (1000 terabytes) of data. Hadoop™ provides interoperability between disparate file systems, fault tolerance, and High Availability (HA) for data processing. The architecture is modular and expandable with the whole database development community supporting, enhancing, and dynamically growing the platform.
However, because of Hadoop's™ success in the industry, enterprises now have or depend on a large volume of their data, which is stored external to their core in-house database management system (DBMS). This data can be in a variety of formats and types, such as: web logs; call details with customers; sensor data, Radio Frequency Identification (RFID) data; historical data maintained for government or industry compliance reasons; and the like. Enterprises have embraced Hadoop™ for data types such as the above referenced because Hadoop™ is scalable, cost efficient, and reliable.
One challenge in integrating Hadoop™ architecture with an enterprise DBMS is loading data from Hadoop™ and importing and using that data within the enterprise DBMS.
Moreover, loading data quickly into data warehouses (parallel DBMS) is an important task in building and maintaining enterprise data warehouses. Historically, data warehouse loading tools have been mainly focused on single client loading data to data warehouse though through multiple concurrent sessions. In a shared parallel DBMS (PDBMS), multiple nodes communicate via high-speed interconnect network and each node has its own private memory and disk(s). In current systems, there are usually multiple virtual processors (collections of software processes) running on each node to take advantage of the multiple CPUs and disks available on each node for further parallelism. These virtual processors, responsible for doing the scans, joins, locking, transaction management, and other data management work, are called Parallel Units or Access Module Processors (AMPs) in Teradata DBMS.
Tables are usually horizontally partitioned across all PUs which allows the system to exploit the I/O bandwidth of multiple disks by reading and writing them in parallel. Hash partitioning is commonly used to partition relations across all PUs. Rows of a table are assigned to an AMP by applying a hash function to Primary Index Column (PI).
The Primary Index Column is one or more attributes from the table, specified by the user or automatically chosen by the system.
Loading data to a table in parallel DBMS with defined Primary Index column (PI) requires PDBMS to load data to the right parallel computing unit (AMP) based on the hash value of the table's PI column and further requires PDBMS to sort the data received by each parallel computing unit based on the hash values of the PI column for the purpose of being able to quickly look up qualified rows based on the PI column value later. Some recent data warehouse benchmarks show that a DBMS's No Primary Index (NoPI) table feature is about 30% faster than loading to a Primary Index (PI) table. A NoPI table is a table without defined Primary Index column. Thus when a row is inserted to the table, the row can be stored on any AMP. Consequently there is a no quick or efficient way to quickly find qualified rows based on Primary Index column since no Primary Index column is defined with a NoPI table. Loading to a NoPI table does not require sorting loaded data on each DBMS Access Module Processor (AMP), which is the main reason why it is faster to load data to a NoPI table than to a PI table.