Drop Nulls from the DataFrame in Pyspark
The dataframe.dropna() is used to remove the null rows from the dataframe and returns a new DataFrame after omitting rows with null values.
Syntax a) dropna()
b) df = df.dropna(subset=[‘column1’])
In this exercise, we are using the datasource employees.csv. You can download the datasource and use for the transformation.
Example: First create the SparkSession and read the data from the CSV file.
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("App Name").getOrCreate() # Import the Data df = spark.read.csv("employees.csv", header=True, inferSchema=True) # Show the data in the DataFrame df.show()
The output of the above code is shown below:

Let’s get the count of rows in the Dataframe.
Python
df.count()
Here, we are going the rows which has Null values in any of the column in the Dataframe.
Python
# drop rows with NULL/NA values in any column newdf=df.dropna() newdf.show()
Let’s again get the count of rows in the Dataframe.
Python
newdf.count()
The output of the above code is shown below:
