Drop Nulls from the DataFrame in Pyspark

The dataframe.dropna() is used to remove the null rows from the dataframe and returns a new DataFrame after omitting rows with null values.

Syntax a) dropna()
b) df = df.dropna(subset=[‘column1’])

In this exercise, we are using the datasource employees.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession and read the data from the CSV file.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

# Import the Data
df = spark.read.csv("employees.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

Drop Nulls from the DataFrame in Pyspark

Let’s get the count of rows in the Dataframe.

Python

df.count()  

Here, we are going the rows which has Null values in any of the column in the Dataframe.

Python

# drop rows with NULL/NA values in any column
newdf=df.dropna()
newdf.show()  

Let’s again get the count of rows in the Dataframe.

Python

newdf.count()  

The output of the above code is shown below:

Drop Nulls from the DataFrame in Pyspark