fillna() method in Pyspark

The fillna() method in PySpark is used to replace null or NaN values in a DataFrame with specified values.

Syntax pyspark.sql.DataFrame.fillna(value, subset=None)
• value: The value to replace null or NaN values. It can be a single value or a dictionary. • subset: A list of column names to consider for replacement. If not specified, all columns are considered.

In this exercise, we are using the datasource employees.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession and read the data from the CSV file.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

# Import the Data
df = spark.read.csv("employees.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

fillna() method in Pyspark

Let’s replace all the null values in the dataframe with 0. As we are specifying the numeric values so the function will replace all the null values only in the numeric columns with 0.

Python

newdf=df.fillna(0)
newdf.show()  

The output of the above code is shown below:

fillna() method in Pyspark

For example, here we have specified, the string values.

Python

newdf=newdf.fillna("No value Present")
newdf.show()  

The output of the above code is shown below:

fillna() method in Pyspark

We can specify the different replacement values for different columns by using the dictionary.

Python

# Replace null values with different values for different columns
df = df.fillna({'Salary': 500, 'Name': 'Not available', 'Company': 'Not found'})
df.show() 

The output of the above code is shown below:

fillna() method in Pyspark