Replace the value in the Dataframe in Pyspark

The dataframe.replace() method in Pyspark is used to replace the values in the dataframe.

Syntax dataframe.replace(oldvalue, newvalue, ["Columnname1", "columnname2"])

In the above formula, specifying the column names are optional, if we are not specified then the value specified is replaced in every column.

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()   

Read the data from the CSV file and show the data after reading.

Python

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

Replace the value in the Dataframe in Pyspark

Let’s replace the value “TCS” to value “Tata” in all the columns of the dataframe.

Python

df=df.replace("TCS", "Tata")
df.show()  

The output of the above code is shown below:

Replace the value in the Dataframe in Pyspark