Drop Duplicates from dataframe in Pyspark

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

The dataframe.dropDuplicates() method is used to return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

Syntax 1: dataframe.dropDuplicates() To consider a row duplicate, it accounts all the columns in the dataframe.

Syntax 2: dataframe.dropDuplicates(["Column1", "Column2"]) In this case to check duplicates, we specified the column(s) to check the duplicate criteria.

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()   

Read the data from the CSV file and show the data after reading.

Python

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

Drop Duplicates from dataframe in Pyspark

Here, we can see that the no row is deleted.

Python

df.dropDuplicates().show()  
Drop Duplicates from dataframe in Pyspark

Here, we can see that when we specify the column name also, to check duplicate, one row is deleted.

Python

df.dropDuplicates(["Name"]).show() 

The output of the above code is shown below:

Drop Duplicates from dataframe in Pyspark