Avoiding Data Skew in Sqoop

Data Skew in Sqoop is very common in today's big data large scale implementations. However, there aren't many sources/material for troubleshooting the problem and googling doesn't help either. I have tried the below approach but none of them helped.

a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low

b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop

c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck.

d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis.

e) Finally tried using Oracle virtual columns like row_num. Using this approach Sqoop was'nt able to detect the virtual column in the table schema.

However, After thinking through multiple scenarios, I came up with a unique solution by adding row_num to the existing table schema and creating a Materialized View. This significantly improved the performance by 5 times and data was now evenly distributed among the mappers.

I hope this approach will be helpful for troubleshooting the data skew in Sqoop.

Comments

MithunAugust 6, 2017 at 7:06 PM
Good analysis Abhinay!
TejutejuSeptember 25, 2018 at 1:50 AM
Thank you.Well it was nice post and very helpful information on Big Data Hadoop Online Course Bangalore

KevinFebruary 7, 2019 at 11:25 AM
I have used the row_num approach as well for some Oracle migrations. In addition, I also use a technique that uses an array with xargs to avoid the skew, parallelize the ingest, and throttle the concurrent work.

I blogged about the approach here...
http://idledeveloper.com/use-xargs-to-handle-split-by-skew-in-sqoop/

Essentially the solution I use looks something like this...

#pseudo code...
do_work(){
sqoop import \
... \
--query "SELECT * FROM myDb.myTable WHERE order_date = $1 AND \$CONDITIONS"
}

export -f do_work

declare -a order_dates=(20190101, 20190102, ... 20190131, 20190201, ...)

printf "%s\n" "${order_dates[@]}" | xargs --max-procs=3 -I {} bash -c 'do_work "{}"'

AnonymousMay 17, 2022 at 5:27 PM
perde modelleri
sms onay
TÜRK TELEKOM MOBİL ÖDEME BOZDURMA
Nftnasilalinir
Ankara Evden Eve Nakliyat
trafik sigortası
dedektör
WEBSİTE KURMA
ASK ROMANLARİ
AnonymousMay 26, 2022 at 12:20 AM
Smm Panel
smm panel
iş ilanları
instagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
Servis
tiktok jeton hilesi

Search This Blog

Avoiding Data Skew in Sqoop

Avoiding Data Skew in Sqoop

Comments

Post a Comment

Popular posts from this blog

Handling Null Values in Spark