Drop Duplicates from dataframe in Pyspark
In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.
The dataframe.dropDuplicates() method is used to return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
Syntax 1: dataframe.dropDuplicates() To consider a row duplicate, it accounts all the columns in the dataframe.
Syntax 2: dataframe.dropDuplicates(["Column1", "Column2"]) In this case to check duplicates, we specified the column(s) to check the duplicate criteria.
Example: First create the SparkSession.
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("App Name").getOrCreate()
Read the data from the CSV file and show the data after reading.
Python
# Import the Data df = spark.read.csv("data.csv", header=True, inferSchema=True) # Show the data in the DataFrame df.show()
The output of the above code is shown below:

Here, we can see that the no row is deleted.
Python
df.dropDuplicates().show()

Here, we can see that when we specify the column name also, to check duplicate, one row is deleted.
Python
df.dropDuplicates(["Name"]).show()
The output of the above code is shown below:
