Apache Sqoop Performance Tuning Benchmark

Problem 1:Reduce the Time Taken to Perform Sqoop Job.

Solution 1: Increase the number of parallel tasks by using an appropriate value for –m parameter.

Q: How do you determine the right value for –m ?

Ans: Hadoop 2.x (HDI 3.x) uses YARN and each Yarn task is assigned a container which has a memory limit. In other words each mapper would require a container to run. So if we can get a rough estimate of the maximum number containers available in your cluster then maybe you can use that number for –m as a starting point assuming there is no other job running in the cluster and you do not want to run the multiple sets of mappers in serial.

Get the configuration from Cluster

mapreduce.map.memory.mb = 768

mapreduce.reduce.memory.mb = 1536

yarn.app.mapreduce.am.resource.mb = 768

For example if a worker node RAM size is 7 GB then you can have ( 6*1024) / 768 = 8 containers per worker node for mapper.

So if we have 4 Worker Nodes in the cluster then 8*4 = 32 mappers we can specify

Solution 2: Use Direct: (Sqoop Version 1.4.5)

Specifies the direct import fast path when you import data from Oracle.

Rather than using the JDBC interface for transferring data, the direct mode delegates the job of transferring data to the native utilities provided by the database vendor. In the case of MySQL, the mysqldump and mysqlimport will be used for retrieving data from the database server or moving data back. In the case of PostgreSQL, Sqoop will take advantage of the pg_dump utility to import data. Using native utilities will greatly improve performance, as they are optimized to provide the best possible transfer speed while putting less burden on the database server. There are several limitations that come with this faster import.

sqoop import –connect jdbc:mysql://localhost/sqoop_test –table stocks –direct

The following image shows the performance impact of tuning the direct argument:

Solution 3: Importing Data using Fetch-size

Specifies the number of entries that Sqoop can import at a time.

Use the following syntax: –fetch-size=<n>

Solution 4: .Inserting Data in Batches

Specifies that you can group the related SQL statements into a batch when you export data.

The JDBC interface exposes an API for doing batches in a prepared statement with multiple sets of values. With the –batch parameter, Sqoop can take advantage of this. This API is present in all JDBC drivers because it is required by the JDBC interface. Enable JDBC batching using the –batch parameter.

Problem 2: Create Table In Hive automatically And Data Type in Hive Should be same as RDBMS table

Solution: Yes we can create table in hive Automatically using below command and the datatype will be matched.

sqoop create-hive-table –connect jdbc:mysql://localhost/test1 –table mytable3 –hive-table tab3 –username test -P;

Problem 3: Export hadoop file as csv file and download into local

Hue doesnot support full download of data, it will limit number of records.

Leave a comment