Transferring table data between a Relational Database Management System (“RDBMS”) and a big data platform can be time consuming because of a large size of the table data. The big data refers to a massive value of both structured and unstructured data that is too large to process using traditional database techniques, e.g. a query-in-serial. The big data platform, such as Hadoop and Spark, usually adopts a distributed storage architecture (or a distributed file system) and a distributed processing architecture. Tools are usually available for transferring big data from the RDBMS to the file system of the big data platform, e.g. Sqoop is provided for Hadoop. But these tools are unable to solve the issue of time consuming.
The big data platforms attempt to achieve performance and scalability by partitioning the table data into chunks for parallel tasks. One mechanism for such parallel transferring is to submit Structured Query Language (“SQL”) statements for querying the RDBMS via a java connectivity tool called Java Database Connectivity (“JDBC”) interface. Each of the SQL statement maps to a partition of the table data. By concurrently submitting the partitioned SQL statements with parallel tasks, high throughput data transfer can be achieved.
Currently-available parallel data transferring approaches implement one of two approaches to partition a transfer query. One approach used by Sqoop is to evenly partition key value ranges for each partitioned query. This approach can create an issue of straggling among the parallel tasks when the table data is not evenly distributed, commonly known as “skew”, within the key ranges. A second approach, which can handle the issue of straggling better, is designed to pre-execute a query, before the data transfer, for counting a number of rows. The approach identifies sizes, in the number of rows, for each parallel task, and uses nonstandard SQL syntax to locate logical starting point for each parallel task. However, the pre-executed query is often an expensive one because of a “sort operation” needed to guarantee a consistent result.
In view of the foregoing reasons, there is a need for systems and methods for transferring table data from the RDBMS to the big data platform without straggling or incurring an expensive query.