Avoiding Data Skew in Sqoop

Data Skew in Sqoop is very common in today's big data large scale implementations. However, there aren't many sources/material for troubleshooting the problem and googling doesn't help either. I have tried the below approach but none of them helped.

a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low

b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop

c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck.

d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis.

e) Finally tried using Oracle virtual columns like row_num. Using this approach Sqoop was'nt able to detect the virtual column in the table schema.

However, After thinking through multiple scenarios, I came up with a unique solution by adding row_num to the existing table schema and creating a Materialized View. This significantly improved the performance by 5 times and data was now evenly distributed among  the mappers.

I hope this approach will be helpful for troubleshooting the data skew in Sqoop.


  1. Thank you.Well it was nice post and very helpful information on Big Data Hadoop Online Course Bangalore

  2. I have used the row_num approach as well for some Oracle migrations. In addition, I also use a technique that uses an array with xargs to avoid the skew, parallelize the ingest, and throttle the concurrent work.

    I blogged about the approach here...

    Essentially the solution I use looks something like this...

    #pseudo code...
    sqoop import \
    ... \
    --query "SELECT * FROM myDb.myTable WHERE order_date = $1 AND \$CONDITIONS"

    export -f do_work

    declare -a order_dates=(20190101, 20190102, ... 20190131, 20190201, ...)

    printf "%s\n" "${order_dates[@]}" | xargs --max-procs=3 -I {} bash -c 'do_work "{}"'


Post a Comment

Popular posts from this blog

Handling Null Values in Spark