Handling Null Values in Spark
Handling null values in Spark can be a real pain If one is not well versed with using the inbuilt spark functions. As part of the ETL Logic, we were massaging the data on S3 which was in Parquet format. However, while loading the data into RedShift it was failing as null values were unable to insert into char and number fields of the RedShift schema. Hence, we had to again massage the data to first convert char null values to blank values.
Below is the snapshot of the original data which had null values in V_DQ_SEVERITY column which was char datatype and V_DEFAULT_COUNT which was of number datatype.
V_DQ_SEVERITY | N_DEFAULT_COUNT |
null | 0 |
null | 0 |
E | null |
E | null |
E | null |
Below is the code snippet of first convert all null values to blank spaces.
First read the parquet file using Spark
val deltaVltdDF = sqlContext.parquetFile("<Path to S3 Location>")
val deltaVltdDF1=deltaVltdDF.na.fill("") -- This peice of code will replace all null string values to blank
V_DQ_SEVERITY | N_DEFAULT_COUNT |
0 | |
0 | |
E | null |
E | null |
E | null |
After transforming null String to blank values, next I converted null numeric values to 0 for further processing and finally loading data into RedShift.
Below is the code snippet for transforming null numeric values to 0.
val map = deltaVltdDF1.columns.map((_, "0")).toMap -- Creating a map of the dataframe columns
val deltaVltdDF2=deltaVltdDF1.na.fill(map)
V_DQ_SEVERITY | N_DEFAULT_COUNT |
0 | |
0 | |
E | 0 |
E | 0 |
E | 0 |
After the above transformations, I was able to successfully load the data into RedShift without worrying about the null values.
I had to spend a lot of time in handling the above data transformation so I am hoping the above information would be useful.
Comments
Post a Comment