Posts

Showing posts from August, 2017

Handling Null Values in Spark

Handling null values in Spark can be a real pain If one is not well versed with using the inbuilt spark functions. As part of the ETL Logic, we were massaging the data on S3 which was in Parquet format. However, while loading the data into RedShift it was failing as null values were unable to insert into char and number fields of the RedShift schema. Hence, we had to again massage the data to first convert char null values to blank values. Below is the snapshot of the original data which had null values in V_DQ_SEVERITY column which was char datatype and V_DEFAULT_COUNT which was of number datatype. V_DQ_SEVERITY  N_DEFAULT_COUNT null 0 null 0 E                                  null E                                  null E         ...

Avoiding Data Skew in Sqoop

Data Skew in Sqoop is very common in today's big data large scale implementations. However, there aren't many sources/material for troubleshooting the problem and googling doesn't help either. I have tried the below approach but none of them helped. a) Identify one more column in the composite key for the split by column. However, in my use case, other columns in the composite key columns did not have high cardinality and thus the number of mappers were still relatively low b) Sqoop currently doesn't support splitting by multiple columns so this was a limitation in sqoop c) Increasing the number of mappers would'nt help since 99% of the data was going into only one mapper which was causing a bottleneck. d) Using boundary conditions for Sqoop didn't align with our use case since I was sqooping the data in an incremental fashion using the date column. The Sqoop incremental unload was based on the date column on a daily basis. e) Finally tried using Orac...