Handling Null Values in Spark


Handling null values in Spark can be a real pain If one is not well versed with using the inbuilt spark functions. As part of the ETL Logic, we were massaging the data on S3 which was in Parquet format. However, while loading the data into RedShift it was failing as null values were unable to insert into char and number fields of the RedShift schema. Hence, we had to again massage the data to first convert char null values to blank values.

Below is the snapshot of the original data which had null values in V_DQ_SEVERITY column which was char datatype and V_DEFAULT_COUNT which was of number datatype.

V_DQ_SEVERITY  N_DEFAULT_COUNT
null 0
null 0
E                                  null
E                                  null
E                                                               null

Below is the code snippet of first convert all null values to blank spaces.

First read the parquet file using Spark

val deltaVltdDF = sqlContext.parquetFile("<Path to S3 Location>")

val deltaVltdDF1=deltaVltdDF.na.fill("") -- This peice of code will replace all null string values to blank


V_DQ_SEVERITY N_DEFAULT_COUNT

0

0
                                null
E                                 null
E                                 null


After transforming null String to blank values, next I converted null numeric values to 0 for further processing and finally loading data into RedShift.

Below is the code snippet for transforming null numeric values to 0.

val map = deltaVltdDF1.columns.map((_, "0")).toMap -- Creating a map of the dataframe columns

val deltaVltdDF2=deltaVltdDF1.na.fill(map) 


V_DQ_SEVERITY  N_DEFAULT_COUNT

0

0
E 0
E 0
E 0
 


After the above transformations, I was able to successfully load the data into RedShift without worrying about the null values.

I had to spend a lot of time in handling the above data transformation so I am hoping the above information would be useful.






Comments

Popular posts from this blog

Avoiding Data Skew in Sqoop