Avoiding Data Skew in Sqoop
Data Skew in Sqoop is very common in today's big data large scale implementations. However, there aren't many sources/material for troubleshooting the problem and googling doesn't help either. I have tried the below approach but none of them helped.
a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low
b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop
c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck.
d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis.
e) Finally tried using Oracle virtual columns like row_num. Using this approach Sqoop was'nt able to detect the virtual column in the table schema.
However, After thinking through multiple scenarios, I came up with a unique solution by adding row_num to the existing table schema and creating a Materialized View. This significantly improved the performance by 5 times and data was now evenly distributed among the mappers.
I hope this approach will be helpful for troubleshooting the data skew in Sqoop.
a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low
b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop
c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck.
d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis.
e) Finally tried using Oracle virtual columns like row_num. Using this approach Sqoop was'nt able to detect the virtual column in the table schema.
However, After thinking through multiple scenarios, I came up with a unique solution by adding row_num to the existing table schema and creating a Materialized View. This significantly improved the performance by 5 times and data was now evenly distributed among the mappers.
I hope this approach will be helpful for troubleshooting the data skew in Sqoop.
Good analysis Abhinay!
ReplyDeleteThank you.Well it was nice post and very helpful information on Big Data Hadoop Online Course Bangalore
ReplyDeleteI have used the row_num approach as well for some Oracle migrations. In addition, I also use a technique that uses an array with xargs to avoid the skew, parallelize the ingest, and throttle the concurrent work.
ReplyDeleteI blogged about the approach here...
http://idledeveloper.com/use-xargs-to-handle-split-by-skew-in-sqoop/
Essentially the solution I use looks something like this...
#pseudo code...
do_work(){
sqoop import \
... \
--query "SELECT * FROM myDb.myTable WHERE order_date = $1 AND \$CONDITIONS"
}
export -f do_work
declare -a order_dates=(20190101, 20190102, ... 20190131, 20190201, ...)
printf "%s\n" "${order_dates[@]}" | xargs --max-procs=3 -I {} bash -c 'do_work "{}"'
perde modelleri
ReplyDeletesms onay
TÜRK TELEKOM MOBİL ÖDEME BOZDURMA
Nftnasilalinir
Ankara Evden Eve Nakliyat
trafik sigortası
dedektör
WEBSİTE KURMA
ASK ROMANLARİ
Smm Panel
ReplyDeletesmm panel
iş ilanları
instagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
Servis
tiktok jeton hilesi