Avoiding Data Skew in Sqoop

Posts

Avoiding Data Skew in Sqoop

August 06, 2017

Data Skew in Sqoop is very common in today's big data large scale implementations. However, there aren't many sources/material for troubleshooting the problem and googling doesn't help either. I have tried the below approach but none of them helped. a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck. d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis. e) Finally tried using Orac...

Search This Blog

Avoiding Data Skew in Sqoop

Posts

Handling Null Values in Spark

Avoiding Data Skew in Sqoop